Intel Details Xe GPU Architecture, Enthusiast-Class HPG Hardware
For the last few years, Intel’s GPU marketing division has had a problem: How do you talk about a product you can’t talk about yet? For AMD and Nvidia, both of whom are established in the industry, this is less of an issue: Either company can talk about broad trends in gaming, or the evolution of specific features, or where the industry is headed in technical terms even if they aren’t ready to talk about the new features of an upcoming platform. Intel has been starting from scratch in a field where it’s been historically known for the GPUs it didn’t bring to market rather than the cards it shipped.
Intel, as of today, has sketched out four different microarchitectures for its upcoming Xe family — Xe-HPC, Xe-HP, Xe-HPG, and Xe-LP. We’ll focus on two of them here: Xe-LP and Xe-HPG. These are the architectures intended for low power / early developer work and for gaming, respectively. Xe-LP will first debut in Tiger Lake, as the underlying silicon behind the chip’s 96 EUs. It also powers the DG1 and SG1 that Intel is rolling out. DG1 is Intel’s mobile GPU intended to ship in Tiger Lake laptops, while SG1 is self-explanatory:
Wishful thinking aside, SG1 is four DG1 chips in a combined matrix that the company will sell as a replacement for its Xeon Visual Compute Accelerator. These server-oriented cards were based on Intel’s previous integrated graphics and intended for video encoding workloads. Now that Intel has its own discrete IP, it’ll take over this area as well.
In terms of GPU features, Xe-LP is a DX12_1 architecture, while Turing and RDNA2 are both DirectX 12_2. Specifically, the GPU lacks support for mesh shaders, Tier 2 variable-rate shading, and sampler feedback. This also means that ray tracing isn’t supported on these initial, lower-end chips. That’s not particularly surprising, either — Nvidia isn’t supporting DXR ray tracing below the RTX 2060 right now and AMD isn’t expected to launch a bunch of low-end RDNA2 cards to extend the capability to the bottom of the stack, either. Ray tracing will probably remain the province of mid-to-high-end cards through the next generation, at least. Until GPU manufacturers can guarantee an acceptable experience, they won’t want to push out the feature.
The Xe-LP GPUs in Tiger Lake will pack 96 EUs, 48 texture units, and 24 ROPs, implying an effective 768 GPU cores (eight threads per EU) and a 768:48:24 configuration. That’s very respectable by integrated terms, though we can’t compare GPU core counts alone to measure performance. Supported memory is up to LPDDR4X-4266, with a 128-bit bus to the IMC. Clock speeds are above 1.6GHz, compared with 1.1GHz for Ice Lake (ICL offers a 512:32:16 configuration).
Everything about the Xe design in Tiger Lake is 1.5x larger than its Ice Lake counterpart except for the front-end, which still dispatches 1 primitive per clock. The 1.45x increase in clock speed will translate into a hefty improvement here, and presumably, that’s enough to keep the GPU from becoming geometry-bound. This new 96 EU GPU is now one “slice” in Intel parlance. In Ice Lake, one GPU “slice” contained eight subslices, each with eight EUs. That’s 64 ALU cores per subslice. In Tiger Lake, a slice is now comprised of six subslices, with 16 EUs per subslice. The number of ALU cores per subslice is now 128, up from 64. A number of changes have accompanied this shift, including texture sampler throughput (eight texels per clock, up from four) and there’s now a 64KB L1 texture and data cache attached to each subslice.
Additionally, there are changes to how Intel processes data within the subslices. In the past, each EU was a standalone block comprised of a thread control unit and two 4-wide SIMD blocks. One of these blocks is used for floating-point and integer functions while the other is for FPU and “special” instructions. Wavefronts are dispatched in groups of eight threads and each EU acts independently of the others. In Xe-LP, this changes. Now, two EUs share a single thread control unit. Instead of two four-wide SIMD blocks, Intel using a single SIMD8 block with a SIMD2 block to handle special math functions. This is a different organization of the resources that were available in Gen11 (ICL), and the exact implications for performance are unclear. The point of this shift, according to Intel, is to prevent the special math SIMD instructions from blocking the execution of floating-point code. These new TCUs are also capable of issuing instructions to the SIMD8 and SIMD2 units simultaneously. Intel will also shift scoreboarding control from hardware units over to software.
A number of these moves are similar to changes Nvidia made years ago with Kepler or that AMD made more recently with RDNA. Intel is moving to change its overall EU organization and shift different workloads to different areas of the combined CPU + GPU in order to improve its execution efficiency and performance. Instruction rates are mostly unchanged — 16 FP32 ops/clock and 32 FP16 ops/clock are the same as Gen11, while INT32 throughput has doubled from 4 ops/clock to 8. INT16 throughput has quadrupled, from 8 ops/clock to 32, and the Xe-LP architecture is capable of 64 ops/clock in INT8, while Gen11 didn’t support this format at all. (All ops/clock metrics are per-EU).
As for memory organization changes, the on-die, GPU-specific L3 cache can now be as large as 16MB, though Tiger Lake’s is only 3.8MB, with higher bandwidth. The larger cache is likely reserved for parts like DG1 or even SG1, if Intel decides to configure those chips differently. Total bandwidth to the L3 has been doubled, giving a 1.6GHz IGP over 200GB/s of internal memory bandwidth. The L3 may not be large, but it’ll be extremely high-bandwidth. Ring bus bandwidth has also been doubled to account for these changes.
There are a lot of things Intel hasn’t shared about DG1 yet, including whether the GPU will be able to work in concert with Tiger Lake’s onboard solutions for multi-GPU rendering or what kind of performance we should expect versus AMD’s Ryzen 4000 Mobile parts. Overall, Tiger Lake looks like a significant competitor for AMD on both the CPU and GPU side of the equation.
What About Enthusiast Gaming?
That brings us to Xe-HPG, the enthusiast-gamer part Intel is planning to launch. This GPU will support ray tracing and likely the full DirectX 12 Ultimate standard as well. Unlike Xe-HP (data center features, FP64) or Xe-HPC (Ponte Vecchio, built from a four-chip stack), Xe-HPG is entirely gamer-focused. Intel claims that it’ll pull clock speed improvements over from Xe-HPC and emphasize raw scalability for Xe-HPG. It isn’t clear how different any of these chips will be at the architectural level; Nvidia and AMD typically limit their consumer cards in a few ways compared with the data center variants, while much of the underlying architecture remains unchanged. Intel has confirmed, however, that Xe-HPG will use GDDR6 rather than HBM, with the memory controller licensed from a non-Intel source. Intel hasn’t confirmed its foundry partner for Xe-HPG, but it won’t be built at Intel, leaving TSMC or Samsung as options.
The Xe-HPG is expected to launch in 2021.
- Intel Unveils 10nm ‘SuperFin’ Process Node, Tiger Lake Details
- Nuvia: Our Phoenix CPU Is Faster Than Zen 2 While Using Much Less Power
- Happy Birthday to the PC: Either the Best or Worst Thing to Happen to Computing