1–7x Consumer GPU Scaling for Large Language Modeling Experiments

Published in

Daimon Labs

4 min readJul 14, 2023

Introduction

At Daimon Labs, we are building emotion-capable dialogue systems based on the latest development in transformer-based large language models. These models often require a large amount of compute for training and inference, usually to the scale of tens of thousands of GPUs. At the outset of our startup company in early 2022, we realized that it is infeasible for us to afford such compute, and we started to explore several methods in all of hardware, software, and AI modeling improvements to reduce the monetary cost in deploying our product as a whole. Eventually, all of our improvements in systems and modeling achieved a measurable monetary cost reduction of more than 300x compared to a vanilla GPT-3 deployment as documented in OpenAI’s technical report.

In this blog post, we introduce how we have built a scalable GPU system using consumer hardware, which is already a 5x cost reduction compared to cloud providers for the same compute and for the period of 1 year. In fact, since these systems have already been running for over 1 year without interruptions, the savings are even better. This blog post is the beginning of a series that introduces all aspects of our hardware, software, and modeling efforts in achieving such cost reduction, which eventually enabled us to obtain ownership of large language models without the need for a huge seed investment round.

System Setup

At the time of building our systems, the best consumer was NVIDIA GeForce RTX 3090. That was also the time when Ethereum mining was about to end due to the transition into proof-of-stake. As a result, there were a lot of cheap used mining 3090 GPUs floating around in the market. We made the decision to buy some of these cards instead of new ones since they frequently are sold at half of MSRP. Luckily for us, all the 3090 GPUs we purchased second-hand were running fine until this day, with only some minor hardware issues like broken fans that are easy to fix by ourselves.

The 2 components that we decided that we shouldn’t go cheap for are the CPU and the RAM. At Daimon Labs, researchers frequently check up on the latest development in language model research, and in the first half of 2022 it was obvious that the size of the models are going to be huge and quantization will play a key role in future model deployments. This was all before ChatGPT made the news and the release of quantization-based model deployment software like llama.cpp. Later, due to these decisions, we were able to immediately use these tools to deploy large language models without the need to change our infrastructure again. Specifically, we went for the best HEDT (high-end desktop) CPU at the time — AMD Ryzen Threadripper PRO 3995WX, and 1TB of DDR4 registered ECC RAM.

With these decisions made, we are ready to pick out other components for our server. The first was the motherboard, which is an ASUS Pro WS WRX80E-SAGE SE WIFI that works with all the hardware. It is also equipped with 8 SATA ports for connecting to 2 8TB SATA SSD drives for the root mount, and 5 18TB SATA HHDs that is setup as an MADM RAID-5 array. 3 consumer-level 1600W PSUs are needed to provide enough power to the GPUs.

The most creative part of our server is the chassis, or in fact, the lack thereof. All components of the server are placed in an open frame that has 3 layers. From our experience, we found that it is important to purchase good-quality PCI-E extension cables since the cheap ones have a high failing rate. Even then, we were only able to achieve full PCI-E 3.0 x16 speeds even though both the cables and the GPUs support PCI-E 4.0 x16.

Conclusion

Building and scaling up this server has been a fun project for the team. It has also been running stably ever since putting into production, except for 1 minor fix needed for a failing GPU fan. Having a server like this in house has made us immune to the extreme competition on cloud resources, a result from the recent hype around large language models. It also enabled us to perform quick experiments and become agile with the super exciting open source development that is happening in the community.

Since the server was put into production, there have been a lot of new hardware developments in the market. These include the release of the new generation of NVIDIA consumer GPUs, and the wide availability of SlimSAS / Oculink adapters that can hopefully offer full PCI-E 4.0 x16 speeds, replacing the unstable extension cables. We are keen in keeping up to date with the developments, and ensure that Daimon Labs has what is needed to deploy large language models easily and cheaply.

1–7x Consumer GPU Scaling for Large Language Modeling Experiments

Introduction

System Setup

Conclusion

Written by Xiang Zhang