Cerebras AI Smashes AWS, Writes Code 75x Faster Using World’s Biggest Chip!

Home » Technology » Cerebras AI Smashes AWS, Writes Code 75x Faster Using World’s Biggest Chip!
Cerebras video shows AI writing code 75x faster than world's fastest AI GPU cloud — world's largest chip beats AWS's fastest in head-to-head comparison

Cerebras claims its technology is 75 times faster than leading hyperscaler GPUs.

Cerebras has achieved a processing speed of 969 tokens per second with Meta’s Llama 3.1 405B large language model, which is 75 times quicker than the fastest AI service equipped with GPUs from Amazon Web Services.

This impressive performance was recorded on the Cerebras Inference cloud AI service, which leverages the company’s third-generation Wafer Scale Engines instead of traditional GPUs from Nvidia or AMD. From its inception in August, the Cerebras Inference service has been touted as being significantly faster at generating tokens, which are the basic components of responses from a large language model (LLM). Initially, the service was reported to be about 20 times speedier than Nvidia GPUs provided by cloud services like Amazon Web Services for smaller models such as Llama 3.1 8B and Llama 3.1 70B.

However, since July, Meta has introduced the Llama 3.1 405B model, which is significantly more complex, containing 405 billion parameters compared to the 70 billion parameters of the Llama 3.1 70B. Cerebras has demonstrated that its Wafer Scale Engine processors can handle this vast LLM at what it describes as “instant speed,” delivering a token rate of 969 per second and achieving a time-to-first-token in just 0.24 seconds—a record-setting performance, not only for Cerebras technology but also for the Llama 3.1 405B model itself.

When compared to Nvidia GPUs available through AWS, Cerebras Inference operated 75 times faster; it was even 12 times quicker than the fastest Nvidia GPU setup from Together AI. Even SambaNova, a competitor in AI processor design, was outperformed by a factor of six by Cerebras Inference.

To put this into perspective, Cerebras prompted both Fireworks (the fastest AI cloud service using GPUs) and its own Inference service to develop a chess program in Python. It took Cerebras Inference merely three seconds to complete the task, while Fireworks required 20 seconds.

Cerebras announced that Llama 3.1 405B on their system is the world’s fastest frontier model—12 times quicker than GPT-4o and 18 times faster than Claude 3.5 Sonnet. This achievement is attributed to Meta’s open approach paired with Cerebras’s cutting-edge inference technology, allowing Llama 3.1-405B to operate more than 10 times faster than other leading closed frontier models.

See also  Huawei’s New AI Chip Revolution: How SMIC's Ascend 910B Differs from TSMC’s Version!

Even when the query size was increased from 1,000 tokens to 100,000 tokens (a prompt consisting of at least a couple thousand words), Cerebras Inference managed a speed of 539 tokens per second. Among the five other services capable of handling this large task, the next best achieved only 49 tokens per second.

Cerebras also highlighted that a single unit of its second-generation Wafer Scale Engine outperformed the Frontier supercomputer by 768 times in a molecular dynamics simulation. Frontier was the world’s fastest supercomputer until the recent launch of the El Capitan supercomputer, which is powered by 9,472 AMD Epyc CPUs.

Moreover, the Cerebras chip surpassed the Anton 3 supercomputer’s performance by 20%, a notable achievement considering Anton 3 was specifically designed for molecular dynamics simulations; it also marked the first time a computer reached over one million simulation steps per second.

Similar Posts

Rate this post
Share this :

Leave a Comment