TESLA BOT <OPTIMUS >AND THE PROJECT DOJO

5 min readAug 24, 2021

INTRODUCTION

The Dojo chip is a technology designed by Tesla initially to improve the computing power of their Autopilot AI system available in their cars at the moment of the announcement.

DPU (stands for Dojo Processing Unit) is described by the AI Tesla team as a « virtual device that can be sized according to the application needs »

The initial goals of project Dojo is to :

Achieve better AI training performance especially in order to train the neural network serving Tesla Autopilot,
Enable larger and more complex neural network models,
Being power efficient and cost effective,

The challenge is to scale computing power based on a distributed compute architecture in maximizing the bandwidth and minimizing the latency, exploiting temporal and spatial localities.

1. SMALLEST ENTITY SCALE « Training Node » 🎛️

This is a 64-bit superscalar CPU optimized around matrix multiplied units and vector SIMD. It supports floating point 32 (FP32), binary float point 16 (BFP16) and a new format configurable FP8 (CPF8). All this is backed by 1.25 MB of fast ECC protected SRAM and the low-latency high bandwidth fabric that they have designed.

1024 GFLOPS (BF16/CFP8) ou 1 TFLOPS
CPU 64 GFLOPS (FP32)
512 GB/s in each cardinal direction of the network
4 Way Multithreaded, allow compute and data transfer simultaneously
Custom ISA (Instruction Set Architecture) optimized for machine learning kernels

2. COMPUTE ARRAY « D1 » 🔳

354 Training nodes network
Delivering 362 TFLOPS with BF16 and CFP8 of machine learning compute
And 22.6 TFLOPS with FP32
10 TBps/dir of transfert for the bisection bandwidth
576 High speed low services lanes at 122GB
4 TBps/edge off-chip bandwidth (i/o)
400W TDP (Thermal Design Power)
645mm² of surface
7 nm engraved technology
50 Bilions of transistors
11+ Miles of wires (~17km)

D1 Accelerator chips (Compute + Logical Memory)
Dojo interface processors (Ingest + Shared Memory)

It is described by the AI team lead engineer having a :

GPU level of compute,
CPU level flexibility,
Twice the network chip level I/O

3. UNIT OF SCALE SYSTEM « Training Tile » 🧱

Here is the part where things are going very serious:

Twenty-five of Dojo chip (x25 D1) dies onto a wafer (a thin slice of semiconductor) process seamlessly attached together.
High bandwidth with high density connectors to preserve the I/O of 9TB/s on each side of the « Training tile » which bring it to 36 TB/s out tile bandwidth.
Power is supplied by a custom voltage regulator module that could be reflowed directly directly onto this fan out wafer.
Finally they integrated the entire electrical thermal and mechanical pieces with a 52 volt DC input.
Achieving 9 Peta Flops of computing.

4. NETWORK OF TILES « Training matrix » 💠

And here is the supercomputing part, beast mode engaged.

2x3 Tiles x 2 trays in cabinet
100+ PFLOPS/Cabinet
12 TBps Bisection Bandwidth

5. THE « ExaPOD » 🎚️

And finally. Have you seen the movie “The Matrix” because watching the ExaPOD will definitely give you pause for thought when trying to figure out what kind of hardware could have handled this technology.

1.1 EFLOP (Exaflop) with BF16 and CFP8
120 Training Tiles
3000 D1 Chips
> 1M Training Nodes
Still with uniform high bandwidth and low-latency fabric

…You’re still hungry ? 🍴

6. TRANSFERT LEARNING TO THE TESLA BOT «Optimus» 🤖

At the end of the presentation, Elon Musk introduced « Optimus » (meaning “good “in latin) the Tesla humanoid bot. The point is unmissable. Tesla let us understand the feasibility of integrating all this powerful and compact technology in an humanoid alike bot.

Why ? In Elon’s words, the goals of this bot are first and foremost to help humans with boring, repetitive and dangerous tasks. We know how much Elon fears boredom (cf. The Boring Company).

All is set ! At least on the paper. But when we look at what achieved Elon Musk particularly with SpaceX during the last 15 years we can safely say that we are going to hear from Optimus very soon.

So what would be the point of integrating the largest AI-based model into the most powerful CPU architecture to do dangerous things? Look up and you will find your answer… 🪐