TESLA BOT <OPTIMUS >AND THE PROJECT DOJO

Thibaut BONACORSI
5 min readAug 24, 2021

--

The Tesla Bot “Optimus”

INTRODUCTION

The Dojo chip is a technology designed by Tesla initially to improve the computing power of their Autopilot AI system available in their cars at the moment of the announcement.

DPU (stands for Dojo Processing Unit) is described by the AI Tesla team as a « virtual device that can be sized according to the application needs »

The initial goals of project Dojo is to :

  • Achieve better AI training performance especially in order to train the neural network serving Tesla Autopilot,
  • Enable larger and more complex neural network models,
  • Being power efficient and cost effective,

The challenge is to scale computing power based on a distributed compute architecture in maximizing the bandwidth and minimizing the latency, exploiting temporal and spatial localities.

Dojo project

1. SMALLEST ENTITY SCALE « Training Node » 🎛️

This is a 64-bit superscalar CPU optimized around matrix multiplied units and vector SIMD. It supports floating point 32 (FP32), binary float point 16 (BFP16) and a new format configurable FP8 (CPF8). All this is backed by 1.25 MB of fast ECC protected SRAM and the low-latency high bandwidth fabric that they have designed.

  • 1024 GFLOPS (BF16/CFP8) ou 1 TFLOPS
  • CPU 64 GFLOPS (FP32)
  • 512 GB/s in each cardinal direction of the network
  • 4 Way Multithreaded, allow compute and data transfer simultaneously
  • Custom ISA (Instruction Set Architecture) optimized for machine learning kernels
Training node architecture

2. COMPUTE ARRAY « D1 » 🔳

  • 354 Training nodes network
  • Delivering 362 TFLOPS with BF16 and CFP8 of machine learning compute
  • And 22.6 TFLOPS with FP32
  • 10 TBps/dir of transfert for the bisection bandwidth
  • 576 High speed low services lanes at 122GB
  • 4 TBps/edge off-chip bandwidth (i/o)
  • 400W TDP (Thermal Design Power)
  • 645mm² of surface
  • 7 nm engraved technology
  • 50 Bilions of transistors
  • 11+ Miles of wires (~17km)

D1 Accelerator chips (Compute + Logical Memory)

Dojo interface processors (Ingest + Shared Memory)

DOJO chip D1

It is described by the AI team lead engineer having a :

  • GPU level of compute,
  • CPU level flexibility,
  • Twice the network chip level I/O

3. UNIT OF SCALE SYSTEM « Training Tile » 🧱

Here is the part where things are going very serious:

  • Twenty-five of Dojo chip (x25 D1) dies onto a wafer (a thin slice of semiconductor) process seamlessly attached together.
  • High bandwidth with high density connectors to preserve the I/O of 9TB/s on each side of the « Training tile » which bring it to 36 TB/s out tile bandwidth.
  • Power is supplied by a custom voltage regulator module that could be reflowed directly directly onto this fan out wafer.
  • Finally they integrated the entire electrical thermal and mechanical pieces with a 52 volt DC input.
  • Achieving 9 Peta Flops of computing.
The 7 layers training tile

4. NETWORK OF TILES « Training matrix » 💠

And here is the supercomputing part, beast mode engaged.

  • 2x3 Tiles x 2 trays in cabinet
  • 100+ PFLOPS/Cabinet
  • 12 TBps Bisection Bandwidth
The training matrix: x6 training tiles

5. THE « ExaPOD » 🎚️

And finally. Have you seen the movie “The Matrix” because watching the ExaPOD will definitely give you pause for thought when trying to figure out what kind of hardware could have handled this technology.

  • 1.1 EFLOP (Exaflop) with BF16 and CFP8
  • 120 Training Tiles
  • 3000 D1 Chips
  • > 1M Training Nodes
  • Still with uniform high bandwidth and low-latency fabric
The Tesla Super Computer: EXAPOD

…You’re still hungry ? 🍴

6. TRANSFERT LEARNING TO THE TESLA BOT «Optimus» 🤖

Tesla AI integrated in “Optimus”

At the end of the presentation, Elon Musk introduced « Optimus » (meaning “good “in latin) the Tesla humanoid bot. The point is unmissable. Tesla let us understand the feasibility of integrating all this powerful and compact technology in an humanoid alike bot.

Why ? In Elon’s words, the goals of this bot are first and foremost to help humans with boring, repetitive and dangerous tasks. We know how much Elon fears boredom (cf. The Boring Company).

Specifications of “Optimus”

All is set ! At least on the paper. But when we look at what achieved Elon Musk particularly with SpaceX during the last 15 years we can safely say that we are going to hear from Optimus very soon.

So what would be the point of integrating the largest AI-based model into the most powerful CPU architecture to do dangerous things? Look up and you will find your answer… 🪐

Optimus will certainly go to Mars — Credits : spaceexplored.com

IF YOU WANT TO KEEP IN TOUCH WITH ME 👋

· TWITTER

· LINKDIN

· GITHUB

--

--

Thibaut BONACORSI

Python coder - AI Engineer - Quantitative - Looking to the stars