GPU chiplets at AMD: This is in the patent application for better shader utilization

AMD has published a patent application for splitting the rendering load across multiple GPU chiplets that provides some interesting insights. In order to optimize the utilization of the shaders in games, a game scene is divided into individual blocks and distributed to the chiplets. Two-level binning is used for this.

Table of contents

1 A whole wave of new patent applications
The classic distribution of the load on shaders
New challenges in multi-chiplet GPUs
AMD's approach: two-level binning
When will Radeon come with GPU chiplets?

A whole wave of new patent applications

With a veritable spate of published patent applications over the last week, AMD has potentially revealed a lot of new insights into upcoming technologies in GPUs and CPUs. On June 30 alone, 54 patent applications were published. It remains to be seen which patents will ultimately be approved and which will actually be found in products. Regardless of this, the applications provide interesting insights into the technological approaches pursued by AMD.

Of particular interest is patent application US20220207827 for two-stage binning of image data in order to be able to better distribute the rendering loads of a GPU over several chiplets. AMD had already submitted the application at the end of December 2021.

The classic distribution of the load on shaders

Traditionally, the rasterization of image data on a GPU works relatively simply: Each shader unit (ALU) of the GPU can take on the same task, namely assigning a color to individual pixels. For this, the texture polygon located at the location of the corresponding pixel in the game scene is mapped onto the pixel. Since the arithmetic task is always the same in principle and differs only through different textures at different points in the scene, the working method is called “Single Instruction – Multiple Data” (SIMD).

In modern games, this calculation step called “shading” is no longer the only task of a GPU. Instead, a lot of post-processing effects are now added as standard after the actual shading, which add, for example, environmental occlusion, anti-aliasing and shadows. Raytracing, on the other hand, does not take place afterwards, but parallel to shading and represents a completely different method of calculation. There is more about this in the report How GPU rays are accelerated.

In games on GPUs, this computing load scales up to several thousand computing units in an exemplary manner – unlike CPUs, where programs have to be written specially for more cores. This is made possible by the scheduler, which divides the work within the graphics card into smaller tasks that are processed by the compute units (CU). This division is called binning. To do this, the image to be rendered is divided into individual blocks with a certain number of pixels, each block is calculated by a sub-unit of the GPU and then synchronized and assembled. Pixels to be calculated are added to a block until the sub-unit of the graphics card is fully utilized. In this process, the processing power of the shaders, the memory bandwidth and the cache sizes are taken into account.

New challenges with multi-chiplet GPUs

As AMD points out in the patent text, the division and subsequent joining requires a very good data connection between the individual elements of a GPU. This is a hurdle for the chiplet strategy, since data connections outside of a die are slower and have higher latencies.

While the transition to chiplets was relatively easy with CPUs, because a CPU task that has been divided up over several cores also works well on chiplets, this is not the case with GPUs. This means that a GPU's scheduler today is where CPU software was before the introduction of the first dual-core CPUs. Up to now, a fixed separation into several chiplets was not possible.

AMD's approach: two-level binning

AMD wants to solve this problem by using the rasterization -Changed the pipeline to split tasks across multiple GPU chiplets. For this purpose, the binning is expanded and improved. AMD speaks of “two-level binning” or “hybrid binning”.

Instead of dividing a game scene directly into blocks pixel by pixel, the division is carried out in two stages. First the geometry is processed, which means that the 3D scene is converted into a two-dimensional image. This step, called vertex shading, is usually done entirely before rasterization begins. In the case of GPU chiplets, the vertex shading is only minimally prepared on the first GPU chiplet and then the game scene is roughly binned. This creates coarse blocks (coarse bins), which are each processed by a GPU chiplet. Within these rough blocks, the vertex shading is completed, allowing traditional tasks such as rasterization and post-processing to take place.

The chiplet that takes over the division is always the same and is called the “primary chiplet”. It is directly connected to the rest of the PC, primarily the CPU. The other chiplets take a back seat and only complete tasks when assigned to them. To do this, they work asynchronously and can continue to work even when the “Primary Chiplet” is busy analyzing the scene for the next frame (“Visibility Phase”). In general, it seems to be an enormous challenge to maximize the utilization of the processing units. While the “primary chiplet” is busy with the coarse binning of the game scene, the other units “wait” for data. If a chiplet finishes its block earlier than the rest, it waits again. That would be inefficient.

In order to optimize the utilization of the chiplets, AMD also provides a dynamic division in the patent in addition to a static division of the work (chiplet 1 always works on block 1, chiplet 2 on block 2, etc.). . The workload of each block is estimated at the beginning in order to then distribute the blocks in such a way that all chiplets are completed at the same time. The two principles are illustrated in the figures contained in the patent “Fig. 4″ and “Fig. 5”.

Image 1 of 2

AMD's approach also takes into account “simple” computing loads, where, for example, old games demand so little of the GPU that it would be uneconomical to split them up over several chiplets. Then the rasterization is completely taken over and processed by the first chiplet. There is no overhead and the remaining chiplets can be sent to an energy-saving state.

With its patent, AMD also protects a driver solution by describing a process via a “non-transitory computer readable medium”. The driver should provide instructions that enable the division of work among the GPU chiplets as described.

When will Radeon come with GPU chiplets?

There is currently no clarity as to when the approach described by AMD for optimized utilization of the shaders on multi-chiplet GPUs in games will become relevant in practice. AMD has meanwhile confirmed that RDNA 3 will be based on a chiplet approach as the basis of Radeon RX 7000 at the end of the year, but not that there will be several GPU chiplets. Recently it was said that although several memory controllers with Infinity Cache chiplets would be used, only one GPU chiplet would be used. It remains to be seen whether these rumors are true in the end.

CDNA 2 already relies on two GCDs (Graphics-Compute-Dies) for the HPC graphics cards of the Instinct-MI200- series, cDNA 3 will build on this. The chiplets are connected via the “AMD Infinity Interconnect”.

Radeon RX 7000 & MI300: RDNA 3 comes with chiplets, but only CDNA 3 stacks them
AMD Radeon RX 7000: Navi 3X and RDNA 3 are planned hybrid in 5 and 6 nm

The editors would like to thank community member @ETI1120 for pointing out this article.

Was this article interesting, helpful or both? The editors are happy about any support from ComputerBase Pro and disabled ad blockers. More about ads on ComputerBase.