We Need to Talk About Elon Musk’s Supercluster

Ashraff Hathibelagal
Predict
Published in
3 min readJul 23, 2024

--

The Memphis supercluster is now assumed to be the largest supercluster for building AI models.

Photo by Lightsaber Collection on Unsplash

Back in May 2024, xAI announced that they had raised USD 6 billion. The main investors were the usual suspects — Marc Andreessen, Ben Horowitz, Sequoia Capital, and Al Waleed bin Talal Al Saud. xAI had said then that they were going to use the money to develop advanced infrastructure capable of training huge AI models. It looks like they have delivered on that promise.

So, they now have a supercluster with 100,000 liquid-cooled NVIDIA H100 Tensor Core GPUs. That’s a lot of compute! And quite a feat too. NVIDIA’s NVLink switch system allows you to interconnect 256 H100 GPUs. It’s bizarre that they managed to connect 100,000.

Elon Musk has said that they used a technology called RDMA ((Remote Direct Memory Access)) network fabric. And it is just a single RDMA fabric. RDMA gives you high bandwidth and low latency by allowing network cards of multiple computer systems to directly send data to each other’s memory without any extra steps. In other words, it bypasses the CPU and the operating system. It is still, of course, not as fast as local RAM, but it’s defintely faster than anything else we had earlier.

Elon Musk has also said that he’s going to have the “world’s most powerful” AI model by…

--

--