Nov 19, 2024, 01:07pm EST
“There is no supercomputer on earth, regardless of size, that can achieve this performance,” said Andrew Feldman, Co-Founder and CEO of the AI startup. As a result, scientist can now accomplish in a single day what it took two years of GPU-based supercomputer simulations to achieve.
When Cerebras announced its record-breaking performance on the 70 billion parameter Llama 3.1, it was quite a surprise; Cerebras had previously focussed on using its Wafer Scale Engine (WSE) on the more difficult training part of the AI workflow. The memory on a CS3 is fast on-chip SRAM instead of the larger (and 10x slower) High Bandwidth Memory used in data center GPUs. Consequently, the Cerebras CS3 provides 7,000x more memory bandwidth than the Nvidia H100, addressing Generative AI's fundamental technical challenge: memory bandwidth.
And the latest result is just stupendous. Look at the charts above for a performance over time, and below to compare the competitive landscape for Llama 3.1-405B. The entire industry occupies the upper left quadrant of the chart, showing output speeds below the 100 tokens-per-second range for the Meta Llama 3.1-405B model. Cerebras produced some 970 tokens per second, all at roughly the same price as GPU and custom ASIC services like Samna N: 6 dollars per million input tokens and $12 dollars per million output tokens.
Compared to the competition, using 1000 input tokens, Cerebras embarrassed GPUs which all produced less than 100 tokens per second.
https://www.forbes.com/sites/karlfreund ... ven-close/


