Large language models can be made 50 times more energy efficient with alternative mathematics and custom hardware, researchers at the University of California, Santa Cruz claim.
In an article titled “Scalable MatMul-free Language Modeling,” authors Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian describe how to moderate artificial intelligence’s energy hunger by eliminating matrix multiplication and adding a custom field programmable gate array (FPGA).
AI – by which we mean predictive, hallucinatory machine learning models – has been terrible for keeping the Earth livable because it consumes so much energy, much of which comes from fossil fuels. Operating data centers to provide AI services has increased Microsoft’s carbon emissions2 emissions by 29.1 percent since 2020, and AI-powered Google searches use 3.0 Wh each, ten times more than traditional Google searches.
Earlier this year, a report from the International Energy Agency was published [PDF] projected that global data center energy consumption will nearly double by 2026, from 460 TWh in 2022 to just over 800 TWh in two years. The hunger for energy to power AI has even revived interest in nuclear power, as accelerating fossil fuel consumption for the sake of chatbots, bland marketing copy, and on-demand image generation has become politically fraught, if not a potential crime against humanity.
Jason Eshraghian, assistant professor of electrical and computer engineering at the UC Santa Cruz Baskin School of Engineering and lead author of the paper, said The register that the research results could deliver 50x energy savings using custom FPGA hardware.
“I have to keep in mind that our FPGA hardware was also very under-optimized,” Eshraghian said. “So there’s still a lot of room for improvement.”
The prototype is already impressive. A billion-parameter LLM can be run on the custom FPGA using just 13 watts, compared to 700 watts that would have been required using a GPU.
To achieve this, the US-based researchers had to move away from matrix multiplication, a linear algebra technique widely used in machine learning that is computationally expensive. Instead of multiplying weights (parameters assigned to connect neural network layers) consisting of floating-point numbers between 0 and 1, the computer scientists added and subtracted binary {0, 1} or ternary representations {-1, 0, 1}, thereby placing less demand on their hardware.
Other researchers have explored alternative architectures for neural networks in recent years. One of these, BitNet, has shown promise as a way to reduce energy consumption through simpler math. As described in an article released in February, representing neural network parameters (weights) as {-1, 0, 1} instead of using 16-bit floating point precision can achieve high performance with far fewer computations.
The work of Eshraghian and his co-authors shows what can be done with this architecture. Sample code has been published to GitHub.
Eshraghian said that using “ternary weights replaces multiplication with addition and subtraction, which is computationally much cheaper in terms of memory usage and the energy of operations actually performed.”
That, he said, is combined with the replacement of ‘self-attention’, the backbone of transformer models, with an ‘overlay’ approach.
“In self-attention, every element of a matrix interacts with every other element,” he said. “In our approach, one element only interacts with one other element. Fewer calculations lead to poorer performance by default. We compensate for this by having a model that evolves over time.”
Eshraghian explained that transformer-based LLMs process all the text at once. “Our model takes each piece of text piece by piece, so our model tracks where a particular word is in a broader context by taking time into account,” he said.
Relying on ternary representation of data hinders performance, Eshraghian acknowledged, but he and his co-authors have found ways to compensate for that effect.
“Given the same number of computations, we perform at the same level as Meta’s open source LLM,” he said. “However, our computations are ternary operations and therefore much cheaper (in terms of energy/power/latency). For a given amount of memory, we do much better.”
Even without the custom FPGA hardware, this approach looks promising. The paper claims that fused kernels in the GPU implementation of ternary dense layers can speed up training by 25.6 percent while reducing memory consumption by 61 percent compared to a GPU baseline.
“Additionally, using lower bit-optimized CUDA kernels increases inference speed by 4.57 times and reduces memory usage by a factor of 10 when the model scales to 13 billion parameters,” the article claims.
“This work goes beyond just software implementations of lightweight models and shows how scalable, yet lightweight language models can reduce both computational requirements and energy consumption in the real world.” ®