English

Dojo: Tesla shows its own D1 chip and ExaFLOPS supercomputer

Aug 21, 2021

732

For the training of artificial neural networks, Tesla is currently still using accelerators from Nvidia. With the Dojo and the specially developed “D1” processors, Tesla is currently building its own supercomputer, which is supposed to deliver more performance with less consumption and less space. Dojo should achieve over 1 ExaFLOPS.

After the Full Self-Driving Computer (FSD) in the car, specially developed hardware is also in the Tesla data center. Nvidia is losing out in both areas, because in the long term the ampere accelerators are to be replaced by their own Tesla processors. For the training of artificial neural networks, Tesla currently relies on three clusters that work with a total of 11,544 Nvidia GPUs. A smaller cluster with 1,752 GPUs, 5 PB NVMe storage and InfiniBand adapters for networking the components is used for automated labeling, while two larger clusters, one with 4,032 GPUs and 8 PB NVMe storage and one with 5,760 GPUs and 12 PB NVMe storage, responsible for the training with a total of 9,792 GPUs.

D1 chip has 50 billion transistors

Tesla wants to get one with “Project Dojo” Build your own supercomputer architecture. The centerpiece is the specially developed D1 chip with 50 billion transistors from 7 nm production on an area of 645 mm². The processor provides a computing power of 362 TFLOPS based on BF16 and CFP8 (Configurable Floating Point 8) and 22.6 TFLOPS for FP32. Tesla specifies the TDP of the chip as 400 watts.

A D1 consists of 354 training nodes, each of which is home to a 64-bit superscalar CPU with four cores, which is specially designed for 8 × 8 matrix multiplication and the formats FP32, BFP16, CFP8, INT32, INT16 and INT8. Training nodes have a modular structure and, according to Tesla, can be linked in all directions via a “low latency switch fabric” with an on-chip bandwidth of 10 TB/s. Tesla spans an I/O ring with 576 lanes of 112 Gbit/s around the D1 for an off-chip bandwidth of 4 TB/s per side.

scalability without bottleneck

The advantage of the high bandwidth is the potential for scalability without bottlenecks. Tesla can, for example, link 1,500 D1 chips and thus 531,000 of the training nodes with one another without restrictions. “Dojo Interface Processors” are used on two sides of this D1 configuration, which Tesla did not explain further, but which have a fabric to the D1 on the one hand and PCIe Gen4 to the hosts in the data center on the other.