TSMC Mulls On-Chip Water-Cooling for Future High-Performance Silicon
Every few years, a major microprocessor manufacturer or research institution dives into the world of radical CPU water-cooling. TSMC recently gave its own presentation on the topic, in which it explored three different methods of potentially cooling a chip with on-die water cooling.
Companies and organizations keep returning to this idea because an integrated, affordable on-die water cooling system could solve a lot of other problems in advanced chip manufacturing. AMD is working on a version of its Zen 3 CPU core with 128MB of integrated L3 cache, but the company had to carefully position the cache chips to avoid causing hot spot problems within the Ryzen die. Vertical die stacking is supposed to drive semiconductor density throughout the 2020s, continuing the overall trend of density improvements we shorthand as “Moore’s Law.” Incidentally, this also allows semiconductor foundries to talk up the idea of extending Moore’s Law, despite the fact that the titular law says nothing about 3D chip stacking, as such.
The industry’s ability to stack chips on top of each other, however, is directly proportional to its ability to keep the silicon stack from roasting in its own heat. Nvidia’s A100 accelerator is specced for 500W TDPs, while Intel’s Ponte Vecchio has a 600W TDP. Manufacturers and designers continue to push the envelope, and some types of on-package (or on-die) water cooling would need to be at least partially integrated by the manufacturer. The specifics here depend on the type of solution considered.
TSMC tested three different methods of creating fluid channels and three different methods of meshing cooling solution and Thermal Test Vehicle (TTV). The three types of fluid channels tested were square pillars, trenches, and a flat plane. The three types of cooler designs TSMC tested were direct-water cooling, a cooler with a silicon-oxide thermal interface material (TIM), and a cooler with a liquid metal TIM.
In Direct Water Cooling, water channels were etched directly into the silicon layer on top of the CPU. In the second test, silicon channels were etched into a silicon layer with a silicon-oxide thermal interface material between the microfluidic system and the actual silicon of the TTV. In the third option, the silicon-oxide TIM was replaced with a liquid metal TIM. TSMC’s data showed that the square pillar design outperformed the other two approaches, so we’ll focus on the figures reported for that solution:
According to TSMC’s test data, their cooler design was able to dissipate 2.6kW of heat at maximum (with a 5.8L/minute flow rate) and a temperature delta of 63C. Direct water cooling performed the best, followed by the silicon oxide TIM. Even the liquid metal TIM, however, was capable of dissipating 1.8kW of heat. That’s far more efficient than anything available today, though obviously, this is a thermal test vehicle/proof of concept, not a final product.
It’s interesting to think of the kind of products that might require that sort of cooling in the future. There is little chance that enthusiast computing would ever reach such lofty heights. A 15-amp circuit at 120 volts can provide a nominal 1800 watts. Enthusiast GPUs may continue to hit higher TDPs — I can’t predict what AMD and Nvidia will do on that front — but we’re a long way from 1kW GPUs, to say nothing of 2.6kW.
The kind of cooling performance improvement that TSMC could potentially deliver would allow for higher clock speeds than anything we’ve seen to date, but not, I suspect, nearly as much as we might want. Silicon simply does not scale well past 5GHz, and manufacturers might not spend much effort trying to push raw performance that way. What’s more interesting is the kind of density improvements this kind of cooling might enable. While cooling hardware requires its own infrastructure, a system capable of dissipating up to 2.6kW of heat could cool much more hardware in a much smaller space than current server deployments.
A cooling system that can dissipate 2.6kW of heat is overkill in any reasonable consumer product, but not necessarily in server or data center systems of the future, especially if more and more processing continues to move to the cloud. Thus far, these sorts of cooling solutions haven’t been commercialized because manufacturers haven’t been forced to adopt such expensive methods of boosting performance to continue offering hardware performance improvements. Hardware TDPs are scarcely going down, however, and the shift towards 3D chip stacking may require a radical rethink of existing cooling strategies in the long term.