How to run massive Large Language Models on Any device

Yes, including LLaMA 2- 70B

Harshita Sharma

Published in

Accredian

4 min readSep 6, 2023

Introduction

In the world of Large Language Models being the centre of all ideas and computations, constantly evolving to achieve the general artifical intelligence, they are still inaccessible to majority of independent developers and consumers.

Even though powerful models like LLaMA 2, Bloom, MPT etc. are open source, they require extremely high end GPUs for computation, costing thousands of dollars or cloud service prices outweighing the choice, which for obvious reasons cannot be the suitable option for majority of people.

This is where Petals AI come into play. Petals is an open-source, distributed network for running text-generating AI which aims to democratize the use of AI by bringing down the costs of running text-generating AI.

What is Petals?

Petals is a decentralized way of running and finetuning large language models.

The coolest fact about petals is that it runs on a technology which has been around for decades! To summarise everything so the following makes sense, consider it as Torrent for AI. You remember Torrent right??(*cough* movies *cough*)

How does Petals work?

On a surface level, Petals works as a decentralized pipeline designed for fast inference of neural networks. Basically it splits any given model into several small blocks (or layers) that are hosted on different servers (just end user consumer grade computers) around the world.

Of course it’s not an neat equal split, petals assigns it on the basis of how much a device can handle, a powerful workstation will handle a bigger chunk of the model as compared to an old laptop.

These servers can be spread out across continents, and anybody can connect their own GPU! In turn, users can connect to this network as a client and apply the model to their data.

So you store a little piece of this model on your computer, combined with all the other people around the world doing the same thing (basically how torrent works), and suddenly you have the most powerful AI computer in the world!!

Client requests is routed through a server chain optimized for minimal forward pass time. Servers adapt by choosing optimal block sets to tackle bottlenecks. It can even be used for chatbots and other interactive apps once it reaches sufficient capacity

How to use it?

As an end user client, you don’t even need to know how it works to use it. Also as it’s a torrent, the more people contribute, the better the network gets.

The Petals repository contains several tutorials and examples showing how to use it for different tasks. They even provide a colab notebook so that you don’t even have to run anything on your local machine!

For running the LLaMA 2- 70B, you can use this wonderful colab notebook by vrsen. It only needs around 2 GB of GPU memory, compared to more than 16 GB that it would normally require. You will need Huggingface access request to get started on this new model.

This is how the Benchmarks of the models look like:

Conclusion

With the direction in which Petals works, it’s not a suprise that the company will implement incentivized methods based on the level of contribution, it’s a torrent after all and the basis is contributions.

So how do you incentivize people for donating their idol GPU time to the broader network? One thing that comes into mind almost immediately is by rewarding people for their compute power, which is suprisingly but not so much is the premise of Blockchains!

For now Petals supports LLaMA and BLOOM, the two most powerful open source models. Let’s wait and see what more exciting things it has in store for us!!