The cost of processing language models is insane. It is estimated that the computation demands of ChatGPT are >$100,000 p/day to serve billions of requests received.
Led by Rui-Jie Zhu, we have developed the first MatMul-free language model (VMM/MMM-free) to scale beyond billion-parameters. Our previous work with SpikeGPT tapped out at about 216M parameters, but our latest model has been able to go up to 2.7B parameters (only limited by compute). We’re pretty certain it can keep going.
We provide a GPU-optimized implementation that uses 61% less VRAM over an unoptimized implementation during training.
However, there are several operations in this model that GPUs aren’t yet fully optimized for, such as ternary operations. So Ethan Sifferman, Tyler Sheaves and Dustin R. built a custom FPGA implementation to really milk it, and we can reach human-reading throughput at 13W. A little less than the power consumed by the human brain.
Preprint: https://lnkd.in/gaWbg7ss
GitHub training code: https://lnkd.in/gKFzQs_z
Pre-trained models on HuggingFace: https://lnkd.in/gDXFjPdm