For Next-Generation CPUs, Not Moving Data Is the New 1GHz
(Credit: Tomekbudujedomek/Getty Images)One way the computer industry has slowly evolved in the past decade is a shift in where engineers are hunting for further performance and efficiency gains. The old focus on clock speed ended in 2004, when Intel canceled Tejas, Jayhawk, and the 4GHz Pentium 4. One could call 2004-2011 the first multi-core era. The median high-end enthusiast CPU core count rose between 4x and 6x in seven years. From 2011-2017, Intel held core counts steady and focused on improving power consumption at lower TDPs. In 2017, AMD effectively kicked the core count war off again.
The growth in six-core and eight-core CPUs has really been something to see. In July 2011, 43.3 percent of gamers had a quad-core CPU according to the Steam Hardware Survey and just 0.08 percent of the market had eight-core chips and 1.36 percent had a six-core CPU. In July 2017, 51.99 percent of gamers had quad-core CPUs, 1.48 percent had a six-core chip, and 0.49 percent of gamers had an eight-core chip. Today, 31.11 percent of gamers have six-core chips and 13.6 percent have an eight-core. That’s a 21x and 27x rise in popularity over just four years, and the renewed competition between Intel and AMD is to thank for it.
Unfortunately, ramping core counts also has its limits. There is a diminishing marginal return from adding new CPU cores in most cases, and the market is still digesting the core count increases of 2017-2019. Lithography no longer yields the performance improvements that it once did; the total cumulative improvement in performance and power consumption that TSMC is projecting from 7nm -> 5nm -> 3nm is approximately equal to the improvements it obtained from shrinking from 16nm -> 7nm. Intel and other semiconductor firms continue to research material engineering improvements, packaging improvements, and new interconnect methods that are more power-efficient or performant than what we have today, but one of the most effective ways to improve power efficiency in a modern system, it turns out, is to stop moving data all over the place.
After decades of power optimization and ever-improving lithography, the total amount of electricity consumed to perform work on one bit of data is roughly 1/3 the cost of retrieving it from memory to be worked upon. According to data published by Rambus, 62.6 percent of power is spent on data movement and 37.4 percent on compute.
One way to solve this problem is with computational storage. The idea is straightforward: Instead of treating the CPU as, well, a central processing unit, computational storage embeds processing capability directly into the storage device itself. This is more plausible with today’s solid-state drives than with older hard drives; NAND flash controllers already do a fair degree of data management under the hood. A recent paper examined the potential power savings of running applications in-place versus traditionally by building a fully functional prototype. The system exhibited a 2.2x increase in performance and a 54 percent reduction in energy consumption “for running multi-dimensional FFT benchmarks on different datasets.”
The idea of processing data in place has applications outside storage; Samsung announced a processor-in-memory stack earlier this year that combines HBM2 with an array of FP16 registers that can perform computations directly rather than on the CPU. In that case, Samsung claimed a 2x performance improvement with a 70 percent power reduction.
These technologies are in their infancy — we’re most likely years away from mainstream applications — but they illustrate how engineers can continue to improve system performance even as lithography scaling falters. Taking full advantage of these ideas will require rethinking the relationship between the various components inside a computer or within an SoC.
From Central Processing Unit to “Accelerator of Last Resort”
I’m willing to bet that all of us, at some point, got handed a diagram that looks a bit like this:
Computers are organized around the idea that many, if not most general computation tasks happen on the CPU, and that the CPU serves as a sort-of arbiter regarding the flow of data through the system. It was not always so. In the late 1990s, anyone with a high-performance storage array used a RAID card to handle it. Beginning in the early 2000s, CPUs became powerful enough for motherboard chipset manufacturers like VIA to integrate support for software RAID arrays into their southbridges. Other companies like AMD, Intel, Nvidia, and SiS did the same, with one notable difference: VIA was the only company willing to ship southbridges that caused unrecoverable storage errors if the end-user was also running a SoundBlaster Live.
As CPUs became more powerful, they absorbed more functions from the microcontrollers and specialized hardware chips that had once performed them. It was cheaper for many companies to allow the CPU to handle various tasks than to invest in continuing to build specialized silicon that could match or exceed Intel.
After several decades of optimization and continued manufacturing and material engineering improvements, the parameters of the problem have changed. Computers operate on huge data sets now, and hauling petabytes of information back and forth across the memory bus at the enterprise level is a tremendous energy burn.
Creating a more efficient computing model that relies less on moving data in and out of the CPU requires rethinking what data the CPU does and doesn’t process in the first place. It also requires a fundamental rethink of how applications are built. SemiEngineering recently published a pair of excellent stories on decreasing the cost of data movement and the idea of computational storage, and they spoke to Chris Tobias, senior director of Optane solutions and strategy at Intel. Some of Intel’s Optane products, like its Direct Connect Optane Persistent Memory, can be used as an enormous bank of non-volatile DRAM — one much larger than any typical DRAM pool would be — but taking advantage of the option requires modifying existing software.
“The only way that you can take advantage of this is to completely restructure your software,” Tobias told SemiEngineering. “What you’re doing now is you’re saying is we have this piece of [an application] that the computational storage does a good job of. We’re going take that responsibility away from the server software, and then farm out multiple copies of this one piece to the SSD, and this is where they’re all going to execute that piece. Somebody’s got to do the chopping up of the server software into the piece that goes into the SSDs.”
These types of efficiency improvements would improve CPU responsiveness and performance by allowing the chip to spend more of its time performing useful work and less time attending to I/O requests that could be better handled elsewhere. One interesting thing we found out about Apple’s M1 and macOS a few months ago is that Apple has improved overall system responsiveness by preferentially scheduling background tasks on the CPU’s small IceStorm cores, leaving the FireStorm cores free for more important tasks. Users have reported that M1 Macs feel snappier to use than conventional Intel devices, even when benchmarks don’t confirm an actual speed increase. Nvidia’s Atom-based Ion netbook platform from 10 years ago is another historical example of how improving latency — display and UI latency, in that case — made a system feel much faster than it actually was.
Nothing that requires a wholesale re-imagining of the storage stack is going to hit consumer products any time soon, but the long-term potential for improvements is real. For most of the computer industry’s history, we’ve improved performance by increasing the amount of work a CPU performed per cycle. The challenge of computational storage and other methods of moving workloads off the CPU is to improve the CPU’s performance by giving it less work to do per cycle, allowing it to focus on other tasks.
Under this model, the CPU would be a bit more like an accelerator itself. Specifically, the CPU becomes the “accelerator” of last resort. When a workload is complex, serialized, or full of branchy, unpredictable code that makes it unsuitable to the GPU and/or whatever future AI hardware AMD and Intel might one day ship, it gets kicked over to the CPU, which specializes in exactly this kind of problem. Move storage searches and some computation into SSDs and RAM, and the CPU has that many more clock cycles to actually crunch data.
That’s what makes not moving data the new “1GHz” target. In the mid-2010s, it was the race to 0W that defined x86 power efficiency, and Intel and AMD both reaped significant rewards from reducing idle power. Over the next decade, we may see a new race begin — one that focuses on how much data the CPU can avoid processing, as opposed to emphasizing how much information it can hoover.