Decentralized Distributed LLM- Using AI in Cost Effective and Environment Friendly way

7 min readFeb 4, 2024

Background

As you all might already know I’m currently working on my open-source project, CookGPT, a personalized chef for Indian recipes. As I plan for the long term, I’m exploring a decentralized deployment approach. Centralized models are not only expensive but also have environmental implications, you can calculate the environmental impact of your ML compute here.

I am working on an approach to distribute the model or its computations across a network with aim to optimize costs and contribute to a more sustainable solution.

Try out the CookGPT demo here

Overview

When diving into large language model based projects, we face issues like not having fancy GPUs, thinking about the environment, and dealing with the cost of these high-tech computer parts. Take a look around — how many of your friends have those top-notch GPUs? And let’s not forget, these things need maintenance and special cooling systems.

Code for this article is here but as always I would encourage you to read full article for in-depth understanding

Costly GPUs are essential for large language models, but they’re not just expensive in terms of money. The environmental impact is significant. These powerful GPUs consume a lot of energy, contributing to a bigger carbon footprint. It’s like they’re not just burning a hole in our wallets, but also leaving a mark on the environment. Plus, the energy demand affects things like the batteries we use, adding another layer of impact to consider. So, it’s not just a financial luxury; it’s an environmental concern too. Many small organizations are struggling to get enough GPU resources. Chip shortages and other issues have created bottlenecks in AI innovation and roll outs.

Now, imagine a way of working together that doesn’t rely on everyone having a super expensive computer setup. It’s not just cost-effective; it’s also kinder to the environment. Win-win, right?

We do have some practical and affordable options to tackle this problem right now.

Decentralized Distributed LLM Framework:

The proposed framework draws inspiration from decentralization principles, seeking to distribute the computational load across diverse regions. By creating a network of contributors, computational resources, and GPUs are shared in a decentralized manner, mitigating the financial burden on individual organizations. This collaborative approach not only optimizes resource utilization but also fosters a global community working towards shared AI goals.

One such platform is Petals , its opensource distributed bittorrent style platform for inferencing and finetuning LLMS. You can read about Petals here

Simple example on how to get started with Petals is as below

from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM

# Choose any model available at https://health.petals.dev
model_name = "petals-team/StableBeluga2"  # This one is fine-tuned Llama 2 (70B)# Connect to a distributed network hosting model layers
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)# Run the model as if it were on your computer
inputs = tokenizer("I ate dosa today", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0]))

You can load any model which is distributed and have users serving it . You can check the models here

AutoDistributedModelForCausalLM does the magic here as it enables the users to team up and collectively perform inference or fine-tuning tasks by loading only a small part of a model.

Imagine this, during India’s working hours, computational units in Europe, which would otherwise remain idle, actively contribute to the shared network. This cross-continental collaboration ensures a continuous flow of computational power, optimizing resource utilization across different time zones.

I was able to run this on a CPU due to its distributed architecture. Petals also has Langchain integration for inferencing and RAG integration.

from google.colab import userdata

os.environ["HUGGINGFACE_API_KEY"] =userdata.get('huggingface_api_key')
llm = Petals(model_name="petals-team/StableBeluga2")

prompting distributed LLM using Langchain

question = "what is paneer butter masala?"

llm_chain.run(question)

And the answer is here

Question: what is paneer butter masala?\n\nAnswer: Let's think step by step.\n\nPaneer is an Indian cheese made by curdling milk with lemon juice or other acidic ingredients. It is a fresh, soft cheese with a mild, milky flavor and a crumbly texture.\n\nButter masala is a creamy, tomato-based sauce that is a popular dish in Indian cuisine. It is made with a blend of spices, tomatoes, and cream or butter, giving it a rich and flavorful taste.\n\nNow, when we combine these two ingredients, we get paneer butter masala - a delicious dish made by cooking paneer (the Indian cheese) in a creamy, tomato-based butter masala sauce. This dish is often served with rice, naan, or other Indian breads.</s>

Choosing many small GPUs over one big GPU is like having a team of friends to help with work instead of doing it all alone. Each small GPU does a part of the job, making things faster and more efficient. This teamwork, known as horizontal scaling, is like everyone in the group pitching in their efforts. It’s not only faster but also costs less. Unlike one big GPU trying to do everything, this way, tasks get done quicker and we save resources. It’s like having many hands to make light work!

What is Petals exactly?

The PETALS system is designed for collaborative inference and fine-tuning of large language models. It offers an alternative to RAM offloading and hosted APIs for very large models, providing access to hidden states of served models for efficient fine-tuning methods. The system incorporates dynamic quantization and prioritizing to reduce communication delays and improve performance. Additionally, PETALS includes algorithms for reliable routing and load balancing. Overall, PETALS aims to broaden access to large language models and enable new applications and research opportunities.

Client and Server

Each participant in the PETALS system can run a server, a client, or both. A server hosts a subset of model layers, typically Transformer blocks, and handles requests from clients. Therefore, the number of layers each user can have depends on the hardware and resources available to them.

Distributed Inference

In the PETALS system, when generating tokens, a client stores the model’s token embeddings locally and relies on servers to run Transformer blocks. Each server holds several consecutive blocks, and the number of blocks depends on the server’s available GPU memory. Clients can form a chain of pipeline-parallel consecutive servers to run the inference of the entire model. The system also includes client-side routing, load balancing, and fault-tolerant protocols to ensure reliable inference and training despite nodes joining, leaving, or failing at any time.

Distributed Finetuning

Fine-tuning involves distributed parameter-efficient training methods such as adapters or prompt tuning. Each client stores trainable soft prompts (task-specific input embeddings) and a linear classification head. During fine-tuning, the client routes its data through a chain of remote servers to compute sentence representations, obtains predictions with the classifier head, and computes the cross-entropy loss. The system also includes algorithms for reliable routing, load balancing, and fault tolerance to ensure efficient and effective fine-tuning across the distributed network.

Load balancing

Load balancing is achieved through a process where servers distribute their workload evenly among Transformer blocks. Each server periodically announces its active blocks to a distributed hash table. When a new server joins, it uses this information to identify an interval of blocks that contains the most blocks with the worst throughput. This interval is always contiguous, and once the server has selected its layers, it measures its own throughput (both network and compute) and announces it to the distributed hash table. This approach ensures that the system maximizes the total model throughput by choosing the blocks with the worst throughput and eliminating potential bottlenecks.

CPU or GPU ?

The system allows for clients to run on CPUs. For simple greedy inference, any CPU that runs PyTorch can be used. However, more advanced algorithms such as beam search may require a GPU for efficient processing. Therefore, the system is designed to accommodate a range of hardware configurations, including the use of CPUs for certain types of inference and fine-tuning tasks.

Client stores the model’s token embeddings, which typically comprise a small fraction of the total parameter count. This allows the client to efficiently handle the local storage of essential model components while relying on remote servers to run the bulk of the model’s computations.

Future?

Distributed Inferencing can potentially be combined with blockchain technology to enhance security and trust within the network. By leveraging blockchain, the system could implement economically motivated approaches to address issues related to faulty or malicious servers. For example, servers could deposit a pledge and use cryptographic hashes to verify the correctness of outputs, with potential rewards for maintaining integrity. This integration could provide additional layers of security and accountability within the distributed network.

Blockchain Integration for Cost-Effective Resource Allocation:

Blockchain technology can seamlessly integrated to ensure fair and cost-effective allocation of resources. Smart contracts automate the distribution of GPU resources, providing contributors with access to surplus computing power in a transparent and equitable manner. This blockchain-driven approach minimizes costs by efficiently allocating resources and promoting a sustainable model of AI development. One such solution is hypertensor you can read about it here

I chose the owl as a symbol for my article on AI not just for its association with wisdom, but also because it stands for ‘Web Ontology Language’ (OWL) in the realm of technology. It signifies the structured and intelligent way in which AI processes information, making it a fitting metaphor for the journey into the world of artificial intelligence.