Hot Chips 33: IBM's Telum architecture relies on 32 MB L2 cache – per core

0
600

IBM's new z-chip “Telum” does not have to hide behind x86 CPUs: The design refinements of the mainframe processor include 32 MB L2 cache – per core. But other properties are also unparalleled in the x86 world. The first systems should be available at the beginning of 2022.

IBM z16 should rely on Telum

For decades, IBM has been one of the companies with the highest number of patent applications. Not only in its home market, the USA, but also globally, the group was repeatedly at the top at the end of the year. Many patent applications come from the CPU area, from which IBM also revealed some interesting things at the Hot Chips 33 conference. The new chip architecture for the z-series is called “Telum”, processors based on this will probably replace the current z15 as the z16.

If you only roughly skim through the key architecture data on paper, you will first find what you are familiar with. For example, IBM's processors have always been large, so 530 mm² is nothing special for a Telum chip. IBM is accommodating 22.5 billion transistors in this area with the help of Samsung, using a 7 nm process that already uses EUV, at least for some of the 17 metal layers.

Image 1 of 6

IBM Telum alias z16
< img src = "/wp-content/uploads/88daeb6f51bd63b22aee12a2f6057679.jpg" /> IBM Telum alias z16
IBM Telum alias z16
IBM Telum alias z16
The fact that “only” eight cores can be found on 530 mm², while AMD also has 8 cores including 32 MB L3 cache on only 81 mm² in Zen 3, is due to the structure with a long pipeline, very large branch tables and, above all, the use of huge native ones Caches, some of which were still outsourced with the predecessors. In addition to the 256 KByte L1 cache per core, half of which is intended for data and half for instructions, there is 32 MByte L2 cache – also per core! The 8-core design brings it to 256 MByte L2 cache, a double ring bus with 320 GB/s bandwidth connects the cores to the cache.

Image 1 of 4

< figure class = "thumbs thumbs - 4" role = "group">

IBM Telum zu Hot Chips 33
IBM Telum to Hot Chips 33
IBM Telum zu Hot Chips 33
The huge L2 cache is not only an L2 cache, but also a virtual L3 cache – this approach has never been seen before. The result: the software, which is tailored to the processors, sees the cache as a physically available L3 cache that is shared across all cores when it addresses the processor. The system continues with a fourth cache level and scales from the dual-chip module (a z16 will probably have two Telum chips) via the 4-socket drawer up to the complete system with four drawers and a total of 32 Telum Chips (4 drawers, each with four CPUs, each with two chips) with 8 GByte L2 cache and its virtual gradations.

Image 1 of 4

IBM Telum zu Hot Chips 33
IBM Telum zu Hot Chips 33
IBM Telum zu Hot Chips 33

With the new architecture, IBM is also throwing old burdens overboard, Telum is really a new beginning. The discontinued technologies also include interconnect and fabric, SMT8 (as with the power processors) or SMT4 do not exist, the new models will rely on SMT2.

40 percent more power per base

At clock rates of over 5 GHz – the predecessor had 5.2 GHz – IBM promises 40 percent more performance per socket. The reference to the socket, however, puts the increase in performance per core into perspective, the predecessor z15 namely had twelve cores per socket, now there are 16 in a dual-chip design. This leaves only a performance increase of around 7 percent per core.

IBM Telum alias z16 (Image: IBM)

In order to be able to grow even more clearly in selected applications, an AI chip is also used, which is available as a co-accelerator for all CPU cores in a Telum chip and works with software adapted to it. IBM has invested a lot of research and development in this, and the company had already revealed the first details in February of this year. The chip supports FP16 and FP32 operations and has a very low latency, which remains almost unchanged over many cores and chips depending on the area of ​​application or should scale almost ideally.

Image 1 of 4

< figure class = "thumbs thumbs - 4" role = "group">

IBM Telum zu Hot Chips 33
IBM Telum zu Hot Chips 33
IBM Telum on Hot Chips 33

As a predestined area of ​​application for the Telum architecture, IBM names credit card billing in real time, the optimization to the AI ​​capabilities of the architecture was carried out in cooperation with a bank. With hundreds of thousands of transactions per second, latencies of 1.1 to 1.2 milliseconds should be achievable in real-world scenarios. This should make it possible to switch from fraud detection to prevention.

Image 1 of 2

IBM Telum zu Hot Chips 33
IBM Telum zu Hot Chips 33

The first systems with the new CPUs are to be delivered from the first half of the coming year. The new mainframes will then replace the z15 systems introduced around two years ago.