AMD Reveals Zen 5’s 16% IPC Gain

With the first Zen 5 CPUs and SoCs shipping later this month, AMD offered a closer look at the architectural improvements behind the platform’s 16 percent increase in instructions per clock (IPC) during its Tech Day event in LA last week.

House of Zen’s 9000 series was announced at Computex in June and follows a similar pattern to previous Ryzen desktop chips, with a choice of six, eight, 12, or 16 cores and up to 64MB of L3 cache on top-tier SKUs.

These same cores are at the heart of Ryzen AI 300, AMD’s answer to Qualcomm’s X-chips for AI PCs. The notebook SoC, codenamed Strix Point, features 12 cores—four Zen 5 and eight Zen 5c—along with a 50 TOPS NPU based on the chip shop’s XNDA 2 architecture.

But while core count, cache, and power all play a role in improving processor performance, a large portion of AMD’s gains come from architectural tweaks to Zen 5’s core. Combined with a node shrink to TSMC’s 4nm process technology, these low-level tweaks to the core contribute to anywhere from a ten to 35 percent increase in performance. In AMD’s internal benchmarks, anyway.

AMD claims its Zen 5 cores deliver between 10 and 35 percent higher instructions per clock than the previous generation – Click to enlarge

Strengthen Zen

According to AMD CTO Mark Papermaster, the biggest improvements to Zen 5’s cores are in the front end, accounting for about 39 percent of the claimed IPC increase.

Notably, AMD has widened the front-end to enable more branch predictions per cycle – a key contributor to performance on modern CPU cores – and implemented a dual-decode pipeline, along with i-cache and op-cache improvements to reduce latency and increase bandwidth.

AMD has increased the front-end, execution, and back-end bandwidth of the Zen 5 core to improve IPC – Click to enlarge

This broader front-end is coupled with a larger integer execution engine that now supports up to eight instructions (dispatch and retire) per cycle, compared to six on Zen 4. AMD also increased the number of arithmetic and logic units (ALUs) from four to six and implemented a more unified scheduler to make execution more efficient.

To reduce the chance of an increase in incorrect predictions, AMD has increased the execution window of Zen 5 by about 40 percent.

“What this does is it brings new levels of performance because it combines with those front-end improvements. It allows us to consume those instructions and take advantage of the improved predictions that are coming to us through the pipeline,” Papermaster explains.

The remaining 27 percent of Zen 5’s IPC gain can be attributed to increased data bandwidth on the backend. Compared to the previous generation, AMD has increased the L1 data cache from 32KB to 48KB and doubled the maximum bandwidth to the L1 and floating point unit – more on that later.

Here’s a rough breakdown of the architectural improvements that contributed to Zen 5’s IPC uplift – Click to enlarge

The key takeaway is that AMD hasn’t just beefed up the branch predictor or execution engine, but has also tried to balance every element of the core to avoid bottlenecks or added latency. The result is a core that can cycle through more instructions faster than previous generations.

Zen 5 revamps AVX-512 implementation

The biggest IPC gains came from workloads leveraging AMD’s AVX-512 vector extensions, which have been redesigned this generation to provide a full 512-bit data path, as opposed to the “double-pumped” 256-bit approach we saw in Zen 4 2022.

The one minor exception to all of this is mobile chips like Strix Point, where AMD opted to stick with a double-pumped AVX-512 implementation – likely to optimize for performance-per-watt and thermal constraints.

While Papermaster claims that Zen 5 can now run full 512-bit AVX workloads without frequency penalties, these instructions have historically run very hot. That’s not a huge issue on the desktop or in workstations, but it is less than ideal for notebooks with limited thermal headroom.

Unsurprisingly, Papermaster was quick to emphasize the potential of the vector extensions to accelerate AI workloads on the CPU. And in machine learning, AMD is claiming a 32 percent increase in single-core performance over Zen 4. With its mobile chips in particular, AMD has emphasized the concept of running machine learning in every domain, not just on the integrated GPU or NPU.

During all the revelations of AMD’s Tech Day, it became clear that the Zen 5 and compact Zen 5c cores are architecturally identical in terms of functionality. The latter trades clocks for chip area, as the name suggests.

More to come

The first Zen 5 cores are expected to hit the market on July 31, but we’ll have to wait a while before they arrive in data centers.

There’s still a lot we don’t know about AMD’s Turin generation of Epycs. However, at Computex we learned that rumors of a new core count bump were true.

With the 5th Gen Epyc, AMD will increase the core count by 50 percent compared to Epyc 4. The Zen 5c part – the spiritual successor to Bergamo – is expected to utilize TSMC’s 3nm node and will feature 192 cores and 384 threads. Meanwhile, the frequency-optimized Turin parts are expected to feature a maximum of 128 cores and 256 threads.

Oddly enough, AMD doesn’t seem to distinguish Turin from what we’re calling “Turin-c” in its marketing. That’s not too surprising, since the only difference between the two – at least at the core level – comes down to the frequency-voltage curve. The smaller Zen 5c cores trade off lower frequencies for higher densities, but are otherwise identical in terms of features.

We expect there to be a few more surprises in store for the Turin launch, scheduled for sometime in the second half of the year.

Competition is increasing

Zen 5 arrives at a time when AMD is facing some of its stiffest competition in years, with Qualcomm unveiling a powerful Arm-compatible notebook chip and Intel readying a series of beefed-up CPUs across its Xeon and Core product families.

Within the client space, Qualcomm’s 45-TOPS NPU gave it an early lead in Microsoft’s Copilot+ AI PC push. AMD’s Strix Point looks to remedy this, but will have to compete against Intel’s recently announced Lunar Lake SoCs, due later in Q3.

AMD predicts future AI PCs will execute 30B parameter models at 100 tokens per second

It’s a similar story in the data center, where things have gotten particularly interesting with the launch of Intel’s 144-core Sierra Forest and upcoming 128-core Granite Rapids Xeon 6 platforms. In addition to an architectural overhaul and shift to a new chiplet architecture, these chips also make the move to Intel’s 3 process node.

At the same time, more cloud providers than ever are leaning on custom Arm-based silicon for their hyperscale workloads. AWS’s Graviton is now in its fourth generation and generally available, while Microsoft and Google have both begun deploying their own Arm cores.

Whether AMD’s IPC gains and higher core counts in the data center will help it gain market share in this competitive arena remains to be seen. In any case, we’re told Zen 6 and Zen 6c are already in the works – as to when we’ll see them, your guess is as good as ours. ®

Strengthen Zen

Zen 5 revamps AVX-512 implementation

More to come

Competition is increasing

AMD predicts future AI PCs will execute 30B parameter models at 100 tokens per second

Leave a Comment Cancel reply