New Preprint: “Scalable MatMul-free Language Modeling” by Ph.D. Candidate Ruijie Zhu

The cost of processing language models is insane. It is estimated that the computation demands of ChatGPT are >$100,000 p/day to serve billions of requests received.

Led by Rui-Jie Zhu, we have developed the first MatMul-free language model (VMM/MMM-free) to scale beyond billion-parameters. Our previous work with SpikeGPT tapped out at about 216M parameters, but our latest model has been able to go up to 2.7B parameters (only limited by compute). We’re pretty certain it can keep going.

We provide a GPU-optimized implementation that uses 61% less VRAM over an unoptimized implementation during training.

However, there are several operations in this model that GPUs aren’t yet fully optimized for, such as ternary operations. So Ethan Sifferman, Tyler Sheaves and Dustin R. built a custom FPGA implementation to really milk it, and we can reach human-reading throughput at 13W. A little less than the power consumed by the human brain.

Preprint: https://lnkd.in/gaWbg7ss

GitHub training code: https://lnkd.in/gKFzQs_z

Pre-trained models on HuggingFace: https://lnkd.in/gDXFjPdm

No alt text provided for this image