CPU Manufacturers Are Pushing the Boundaries of CMOS and Starting to Pay For It
CPUs almost never fail. Out of all the components in a given PC, the CPU has historically been one of the least likely to suffer a failure. This has not yet changed — but there’s troubling evidence suggesting that as process nodes shrink, reliability is becoming tougher for AMD and Intel to guarantee.
Google researchers have published a paper describing what they call “mercurial” cores. Mercurial cores are cores that are subject to what Google calls “corrupt execution errors,” or CEEs. One critical component of CEEs is that they are silent.
We expect CPUs to fail in some noticeable way when they miscalculate a value, whether that results in an OS reboot, application crash, error message, or garbled output. That does not happen in these cases. CEEs are symptoms of what Google calls “silent data corruption,” or the ability for data to become corrupted when written, read, or at rest without the corruption being immediately detected.
This work is still in the early stages and the authors stress that there is much they do not know. What they’ve done is built a model for what a CEE failure generally looks like:
Failures appear to be non-deterministic and they appear at variable rates. Faulty cores fail repeatedly and intermittently. The problem tends to worsen over time. They write:
We have some evidence that aging is a factor. In a multicore processor, typically just one core fails, often consistently. CEEs appear to be an industry-wide problem, not specific to any vendor, but the rate is not uniform across CPU products.
Corruption rates are said to differ by “many orders of magnitude” across defective cores. Workload type, frequency, voltage, and temperature can all impact whether a core throws a CEE. The authors observed failure rates “on the order of a few mercurial cores per several thousand machines.” Keep in mind, a machine likely has somewhere between eight and 64 CPU cores, depending on how old it is.
Google has evidence of mercurial cores violating lock semantics; corrupting data during load, store, and vector operations; corrupting data during storage garbage collection; flipping the same bit position in multiple strings; and corrupting the kernel state. There’s one observed problem worth quoting directly:
A deterministic AES mis-computation, which was “self inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.
The idea of generating code that can only be decrypted by one CPU on Earth is fascinating from a security standpoint and terrifying from an operational one. Google does not disclose how it became aware of this problem, but an issue like this would certainly provoke a detailed analysis of the underlying cause.
Google is still gathering data on this problem. The company does not believe it has necessarily detected every kind of CEE or identified the traits that make a particular chip more likely to develop one in the future. There are several references in the text to the idea that this problem can be triggered when application optimization causes new instructions to be used more frequently.
Google does not state if optimizing for SIMD instruction sets like AVX-512 or AVX2 has been identified as a cause of these problems, or if it was referring to other instructions. But it does confirm that code changes that emphasize different instructions can trigger a problem where one was not previously known to exist.
We Were Warned This Would Happen
This is not a particularly surprising development. The more transistors packed on to a chip, the greater the chance some of those transistors are defective in some way. Modern chip architects duplicate some features with a design, under the assumption that some transistors won’t work properly. This consumes very little additional die space and increases yield.
The idea that CPUs would become less reliable as transistor density increased is a topic people like Bob Colwell, the lead designer on Intel’s 1995 Pentium Pro, were talking about 20 years ago. This is the first report I’ve ever seen in that time suggesting that CPUs from both AMD and Intel could now suffer from various silent errors that may otherwise go undetected in the moment and that the problem is industry-wide.
This incident has some similarities to the old Pentium FDIV bug, but only nominally. The FDIV flaw was silent in most cases, but the issue affected every Pentium Intel had built, and it affected them immediately. According to Google, some chips don’t show evidence of flaws until they’re at a certain age. Google is actively working on writing software to detect CEEs and it calls on both Intel and AMD to test CPUs more effectively before shipping them.
Credit: Laura Ockel/ Unsplash, PCMag