Cerebras claims its technology is 75 times faster than leading hyperscaler GPUs.
Cerebras has achieved a processing speed of 969 tokens per second with Meta’s Llama 3.1 405B large language model, which is 75 times quicker than the fastest AI service equipped with GPUs from Amazon Web Services.
This impressive performance was recorded on the Cerebras Inference cloud AI service, which leverages the company’s third-generation Wafer Scale Engines instead of traditional GPUs from Nvidia or AMD. From its inception in August, the Cerebras Inference service has been touted as being significantly faster at generating tokens, which are the basic components of responses from a large language model (LLM). Initially, the service was reported to be about 20 times speedier than Nvidia GPUs provided by cloud services like Amazon Web Services for smaller models such as Llama 3.1 8B and Llama 3.1 70B.
However, since July, Meta has introduced the Llama 3.1 405B model, which is significantly more complex, containing 405 billion parameters compared to the 70 billion parameters of the Llama 3.1 70B. Cerebras has demonstrated that its Wafer Scale Engine processors can handle this vast LLM at what it describes as “instant speed,” delivering a token rate of 969 per second and achieving a time-to-first-token in just 0.24 seconds—a record-setting performance, not only for Cerebras technology but also for the Llama 3.1 405B model itself.
When compared to Nvidia GPUs available through AWS, Cerebras Inference operated 75 times faster; it was even 12 times quicker than the fastest Nvidia GPU setup from Together AI. Even SambaNova, a competitor in AI processor design, was outperformed by a factor of six by Cerebras Inference.
To put this into perspective, Cerebras prompted both Fireworks (the fastest AI cloud service using GPUs) and its own Inference service to develop a chess program in Python. It took Cerebras Inference merely three seconds to complete the task, while Fireworks required 20 seconds.
Here’s a glimpse of what instant 405B performance looks like: Cerebras versus the fastest GPU cloud: pic.twitter.com/d49pJmh3yTNovember 18, 2024
Cerebras announced that Llama 3.1 405B on their system is the world’s fastest frontier model—12 times quicker than GPT-4o and 18 times faster than Claude 3.5 Sonnet. This achievement is attributed to Meta’s open approach paired with Cerebras’s cutting-edge inference technology, allowing Llama 3.1-405B to operate more than 10 times faster than other leading closed frontier models.
Even when the query size was increased from 1,000 tokens to 100,000 tokens (a prompt consisting of at least a couple thousand words), Cerebras Inference managed a speed of 539 tokens per second. Among the five other services capable of handling this large task, the next best achieved only 49 tokens per second.
Cerebras also highlighted that a single unit of its second-generation Wafer Scale Engine outperformed the Frontier supercomputer by 768 times in a molecular dynamics simulation. Frontier was the world’s fastest supercomputer until the recent launch of the El Capitan supercomputer, which is powered by 9,472 AMD Epyc CPUs.
Moreover, the Cerebras chip surpassed the Anton 3 supercomputer’s performance by 20%, a notable achievement considering Anton 3 was specifically designed for molecular dynamics simulations; it also marked the first time a computer reached over one million simulation steps per second.
Similar Posts
- Nvidia Unveils AI-Powered RTX PC Channel, Fuels Rumors of New Gaming CPU Launch
- Intel’s Jaguar Shores Unveils: The Ultimate AI Powerhouse Merging Gaudi ASICs & Xe-HPC GPUs!
- Intel CEO Visits Elon Musk’s Data Center, Praises Rapid Xeon Deployment & xAI Team
- ASML Launches $230 Lego Set of Its $380M High-NA EUV Semiconductor Tool!
- US Sanctions Stall Huawei’s AI Chip Development, Lagging Behind TSMC Until 2026!
Avery Carter explores the latest in tech and innovation, delivering stories that make cutting-edge advancements easy to understand. Passionate about the digital age, Avery connects global trends to everyday life.