IBM’s New System Z CPU Offers 40 Percent More Performance per Socket, Integrated AI
IBM shared new details on its upcoming Telum CPU at Hot Chips, and the new microarchitecture looks to be a significant advance over the older z15. This will be IBM’s first 7nm CPU built using Samsung’s EUV and a huge step forward for Samsung as far as demonstrating its EUV chops.
IBM’s Telum is a mainframe CPU, which means it operates in a very different compute environment than an x86 chip. Both a mainframe and a server are an integrated platform with a large pool of nearby DRAM, various forms of attached storage, and a large number of CPU cores, but mainframes are architected for very different purposes than your typical x86 server.
Mainframes are designed to maximize system throughput and reliability to a degree that x86 servers don’t match. Where a traditional x86 system has moved as much processing away from accelerators and into the CPU or GPU as possible, mainframes make extensive use of offload hardware in order to keep the CPU available. Mainframes emphasize throughput, redundancy, and security with features that allow for hot-swapping processors or other components in ways x86 systems don’t support. Performance and feature comparisons between mainframes and servers can favor either the mainframe or the x86 system depending on what it is you’re trying to accomplish.
Generally speaking, mainframes are deployed in environments where throughput and reliability demands are high, component failure is unacceptable, and it’s better to pay for equipment that can withstand a CPU or RAM DIMM failure without crashing than to have to take the system offline for any length of time. Mainframes also maintain CPU responsiveness at very high levels of load. They take less of a latency penalty than x86 cores and they juggle I/O workloads more adroitly.
The IBM Telum is laid out differently than a typical x86 CPU because it has a somewhat different role within the system and because mainframes allocate resources very differently than a typical server.
The Telum is built on 7nm technology and is 530 sq. mm. A chip like AMD’s Zen 2 Epyc with eight chiplets and an I/O die is roughly 592 sq. mm for the chiplets and 407 mm sq. for the I/O die. Since Epyc is a disaggregated chip and System Z uses off-die controllers to handle certain task, even comparing die size is a bit tricky. Each Telum contains eight CPU cores with SMT2 enabled, for a total of 16 threads per chip. A four-socket drawer contains eight chips in dual chip modules (64 cores total), with a 2GB virtual cache, and four drawers can be connected for a total of 32 chips (256 cores / 512 threads).
Telum is a significant departure from IBM’s previous z-15 architecture. The z-15 used a large off-die cache and a separate System Control chip with just 12 cores per socket. Not only does Telum increase that to 16 cores, but it also integrates new functionality on-die compared with previous z-machines.
Each Telum core has its own L1 and a 32MB L2. Because L2 cache data attached to one CPU core can be evicted to the L2 cache of a different core, the entire cache can also function as a 256MB “virtual” L3 for each Telum chip. Similarly, the L2 cache of a four-socket drawer can be addressed as a 2GB virtual L4 cache between all of the chips in the drawer. The L2 cache uses a 320GB/s bi-directional ring bus with an average latency of just 12ns. IBM claims that the Telum will run above 5GHz, which is no small achievement for a chip this complex.
One new feature on Telum — which also serves to illustrate the different approach IBM takes to chip design as opposed to Intel — is a new AI acceleration engine. The new engine contains 128 processing tiles designed for 8-way FP16 operations and 32 tiles for 8-way FP32 / FP16 calculations, connected via a 600GB/s bus. If Intel or AMD ever built an AI acceleration unit, we would most likely see that functionality added per core. Intel’s AVX-512 instruction set is intended to increase AI calculation performance, for example, and it’s built into each x86 CPU core. If the microarchitecture offers 1×512-bit register per CPU core and you’ve got 12 cores, you’ve got 12 registers. If you have 24 cores, you have 24 registers.
IBM’s AI unit, in contrast, is equally addressable from any CPU core. Instead, the AI unit serves multiple CPU cores at once, without data ever leaving the chip it’s being processed on. While this would also be true for AVX-512 instructions running on an Intel or future AMD CPU, many AI workloads are run on GPUs today. Data, therefore, flows off the CPU by necessity, and mainframes are designed to be secure at every level in a way that consumer and server hardware isn’t. Keeping the data on-die is a valuable asset in this space. IBM is particularly playing up this capability as a value-add for customers who want to run background AI tasks without compromising CPU availability or responsiveness.
There are articles that run in both directions on whether x86 servers can replace IBM mainframes or vice versa, and both claim that each solution can run laps around the other. While this may be true, it doesn’t seem to be the best way to frame the comparison. Mainframes and typical enterprise x86 systems are sold for different purposes. They run different operating systems, and after decades of differentiation, they focus on delivering top performance in specific metrics. If you don’t need the ability to hot-swap a CPU and RAM or 99.999999 percent uptime, mainframes may not be an appropriate solution. If you do need those things, a mainframe may be the smartest choice.
It’s always interesting to see what IBM is working on, even if it doesn’t directly affect the x86 market much. If nothing else, IBM’s z-system represents a road not taken in consumer computing history, and a type of CPU that has remained relevant in an x86-dominated world by being very good at what it does. Telum supposedly delivers a 40 percent increase in per-socket performance, which likely reflects the shift from 14nm to 7nm as well as the improved system architecture.