Analysis AMD expects to have notebook chips within a few years that can locally execute 30 billion parameters in major languages at a rate of 100 tokens per second.
Achieving this goal – which also requires 100ms first token latency – isn’t as simple as it sounds. It will require optimizations on both the software and hardware fronts. As it stands, AMD claims that its Ryzen AI 300-series Strix Point processors, announced at Computex last month, are capable of executing LLMs with 4-bit precision up to around seven billion parameters in size, at a modest 20 tokens per second and 1-4 seconds first token latency.
AMD aims to execute 30 billion parameter models at 100 tokens per second (Tok/sec), compared to 7 billion and 20 Tok/sec today – click to enlarge
Achieving the 30 billion parameter, 100 tokens per second, “North Star” performance target isn’t just a matter of cramming in a bigger NPU. More TOPS or FLOPS will certainly help – especially when it comes to first token latency – but when it comes to running large language models locally, memory size and bandwidth are much more important.
In this regard, LLM performance on Strix Point is largely limited by the 128-bit memory bus, which when paired with LPDDR5x is good for somewhere in the neighborhood of 120-135GBps of bandwidth depending on how fast your memory is.
If you take it literally, a real 30 billion parameter model quantized to 4-bits will consume about 15GB of memory and require over 1.5TBps of bandwidth to hit that 100 tokens per second target. For reference, that’s about the same bandwidth as a 40GB Nvidia A100 PCIe card with HBM2, but a lot more power.
This means that without optimizations to make the model less demanding, future AMD SoCs will require a much faster, higher-capacity LPDDR to achieve the chip designer’s goal.
AI is evolving faster than silicon
These challenges have not gone unnoticed by Mahesh Subramony, a senior fellow and silicon design engineer who works on SoC development at AMD.
“We know how to get there,” Subramony said The registerbut while it may be possible today to design a part that can achieve AMD’s goals, that doesn’t do much good if no one can afford to use it or if there is nothing that can benefit from it.
“If proliferation starts by saying everybody should have a Ferrari, then cars won’t proliferate. You have to start by saying everybody’s going to have a great machine, and you have to start by showing what you can do with it responsibly,” he explained.
“We need to build a SKU that meets the needs of 95 percent of the people,” he continued. “I would rather have a $1,300 laptop and have my cloud run my 30 billion parameter model. It’s still cheaper today.”
When it comes to demonstrating the value of AI PCs, AMD is leaning heavily on its software partners. With products like Strix Point, that largely means Microsoft. “When Strix first started, we had a deep partnership with Microsoft that really drove our bounding box in a sense,” he recalls.
But while software can help determine the direction of new hardware, it can take years to develop and launch a new chip, Subramony explained. “Gen AI and AI use cases are evolving much faster than that.”
Now that ChatGPT has been around for two years, the company has had time to chart its evolution. Subramony believes AMD now has a better sense of where demand for computing power is headed. That’s undoubtedly one of the reasons AMD set this target.
Overcoming bottlenecks
There are several ways to get around the memory bandwidth challenge. For example, LPDDR5 could be swapped for high-bandwidth memory, but as Subramony notes, that’s not exactly beneficial, as it would drastically increase the cost and compromise the power consumption of the SoC.
“If we can’t get to a 30 billion parameter model, we need to find something that delivers the same kind of reliability. That means making improvements in training to make those models smaller first,” Subramony explained.
The good news is that there are quite a few ways to do this, depending on whether you want to prioritize memory bandwidth or capacity.
AMD Reveals Zen 5’s 16% IPC Gain
READ MORE
One possible approach is to use a mix of experts (MoE) model along the lines of Mistral AI’s Mixtral. These MoEs are essentially a bundle of smaller models working together. Normally, the entire MoE is loaded into memory, but because only one sub-model is active, memory bandwidth requirements are significantly reduced compared to a monolithic model architecture of equal size.
An MoE consisting of six models with five billion parameters would require just over 250 GBps of bandwidth to reach the target of 100 tokens per second – with a precision of at least 4 bits.
Another approach is to use speculative decoding – a process in which a small, lightweight model generates a design that is then passed to a larger model to correct any inaccuracies. AMD told us this approach delivers significant performance improvements – but it doesn’t necessarily address the fact that LLMs are memory-hungry.
Most models today are trained on brain float 16 or FP16 datatypes, which means they consume two bytes per parameter. This means that a model with 30 billion parameters would require 60 GB of memory to run at native precision.
But since that’s unlikely to be practical for the vast majority of users, it’s not uncommon for models to be quantized to 8- or 4-bit precision. This trades off accuracy and increases the chance of hallucinations, but reduces your memory footprint by as much as a quarter. As we understand it, this is how AMD runs a model with seven billion parameters at about 20 tokens per second.
New forms of acceleration can help
As a compromise of sorts, starting with Strix Point, the XDNA 2 NPU supports the Block FP16 datatype. Despite the name, it only requires 9 bits per parameter – it can do this by taking eight floating point values and using a shared exponent. According to AMD, the form can achieve an accuracy nearly indistinguishable from native FP16, while consuming only slightly more space than Int8.
More importantly, we are told that the format does not require models to be retrained to take advantage of it: existing BF16 and FP16 models work without a quantization step.
But unless the average notebook comes with 48GB or more of memory, AMD will have to find better ways to shrink the size of the model.
Although not explicitly mentioned, it’s not hard to imagine that future AMD NPUs and/or integrated graphics will support smaller block floating point sizes. [PDF] such as MXFP6 or MXFP4. For this we already know that AMD’s CDNA datacenter GPUs support FP8 and CDNA 4 will support FP4.
Either way, it looks like PC hardware is going to change dramatically in the coming years as AI moves out of the cloud and into your devices.