Is possible to run Llama2 with 70B parameters on Azure Kubernetes Service with LangChain agents and tools ?

Published in

Microsoft Azure

9 min readJul 27, 2023

In June 2023, I authored an article that provided a comprehensive guide on executing the Falcon-40B-instruct model on Azure Kubernetes Service. Following a similar approach, it is also possible to deploy the Llama2 models on Kubernetes. This current article aims to demonstrate the process of running and testing the Llama-2–70b-chat-hf model. The key points I will be covering in this article are as follows:

Creating an AKS cluster equipped with GPUs to facilitate efficient model execution.
Utilizing the text-generation-inference Hugging Face container to deploy and run the LLama2 model seamlessly.
Thoroughly testing the model’s performance using the LangChain HuggingFaceTextGenInference class, ensuring reliable and accurate results throughout the evaluation process.

Infrastructure

Please refer to my previous article and navigate to the section titled “Create a cluster with GPUs.” In my testing, I used the SKU Standard_NC48ads_A100_v4, which offers a total of 160Gb of GPU Memory (2 x 80Gb). If you intend to simultaneously run both the Llama-2–70b-chat-hf and Falcon-40B-instruct models, you will need two virtual machines (VMs) to ensure the necessary number of GPUs is available.

It’s essential to note that the Llama-2–70b-chat-hf model is categorized as a “gated” model, meaning that direct downloading is not possible without authentication, and you must first accept the license agreement.

To proceed with accessing the Llama-2–70b-chat-hf model, kindly visit the Llama downloads page and register using the same email address associated with your huggingface.co account.

Next, visit the following link: https://huggingface.co/meta-llama/Llama-2-70b-chat-hf and log in to your Hugging Face account. Follow the provided instructions to gain access to the Llama 2 model on Hugging Face.

Once you have gained access to the gated models, go to the tokens settings page and generate a token.

Add the token to this yaml file to pass it as an environment variable to the container, replacing “ACCESSTOKENVALUEHERE”, and you are ready to start the model:

---
apiVersion: v1
kind: Pod
metadata:
  name: llama-2-70b-chat-hf
  labels:
    run: llama-2-70b-chat-hf
spec:
  containers:
    - name: text-generation-inference
      image: ghcr.io/huggingface/text-generation-inference:0.9.3
      resources:
        limits:
          nvidia.com/gpu: 2
      env:
        - name: RUST_BACKTRACE
          value: "1"
        - name: HUGGING_FACE_HUB_TOKEN
          value: "ACCESSTOKENVALUEHERE"
      command:
        - "text-generation-launcher"
        - "--model-id"
        - "meta-llama/Llama-2-70b-chat-hf"
        - "--num-shard"
        - "2"
      ports:
        - containerPort: 80
          name: http
      volumeMounts:
        - name: llama270b
          mountPath: /data
        - name: shm
          mountPath: /dev/shm
  volumes:
    - name: llama270b
      persistentVolumeClaim:
        claimName: llama270b
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 1Gi
  nodeSelector:
    agentpool: gpunp
  tolerations:
    - key: sku
      operator: Equal
      value: gpu
      effect: NoSchedule
  restartPolicy: Never
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
    name: llama270b
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-csi-premium
  resources:
    requests:
      storage: 500Gi
---
apiVersion: v1
kind: Service
metadata:
  name: llama-2-70b-chat-hf
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: llama-2-70b-chat-hf
  type: ClusterIP

You can exec into the pod and run nvidia-smi to check that the memory of both GPUs is almost full:

root@text-generation-inference:/usr/src# nvidia-smi
Wed Jul 26 12:45:35 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000001:00:00.0 Off |                    0 |
| N/A   32C    P0    72W / 300W |  72713MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000002:00:00.0 Off |                    0 |
| N/A   32C    P0    72W / 300W |  72681MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

It is expected that the Llama-2–70b-chat-hf model needs more memory than the falcon-40b-instruct model because there is a jump from 40B to 70B parameters.

Testing with curl the model endpoint

Just as we did in the past to test falcon-40b-instruct, we will replicate the same Pod to make an HTTP POST request with curl:

kubectl run -ti --rm --image=nicolaka/netshoot shell /bin/bash

With the Pod shell connect to the kubernetes service lama-2–70b-chat-hf:

shell:~# time curl -s http://llama-2-70b-chat-hf/generate \
>     -X POST \
>     -d '{"inputs":"Can you give me step to step instructions to prepare Tiramisu?","parameters":{"max_new_tokens":1000}}' \
>     -H 'Content-Type: application/json' | jq -r .generated_text


Answer: Certainly! Tiramisu is a classic Italian dessert made with ladyfingers soaked in coffee and liqueur, layered with a creamy mascarpone cheese mixture. Here's a step-by-step guide to preparing Tiramisu:

Ingredients:

* 12-16 ladyfingers
* 1 cup of strong brewed coffee
* 2 tablespoons of unsweetened cocoa powder
* 2 tablespoons of rum or other liqueur (optional)
* 8 ounces of mascarpone cheese
* 8 ounces of whipping cream
* 1/2 cup of granulated sugar
* 1/2 teaspoon of vanilla extract
* Cocoa powder or powdered sugar for dusting

Instructions:

1. Start by brewing a cup of strong coffee and mixing it with 2 tablespoons of unsweetened cocoa powder. If desired, add 2 tablespoons of rum or other liqueur to the coffee mixture.
2. Dip each ladyfinger into the coffee mixture for about 3-5 seconds on each side. They should be soft and pliable but not too wet.
3. In a large mixing bowl, combine the mascarpone cheese, whipping cream, granulated sugar, and vanilla extract. Beat the mixture with an electric mixer until it's smooth and creamy.
4. To assemble the Tiramisu, start with a layer of ladyfingers in the bottom of a large serving dish. You may need to trim the ladyfingers to fit the dish.
5. Spread half of the mascarpone mixture over the ladyfingers.
6. Repeat the layers, starting with the ladyfingers, then the coffee mixture, and finally the remaining mascarpone mixture.
7. Dust the top of the Tiramisu with cocoa powder or powdered sugar.
8. Cover the dish with plastic wrap and refrigerate for at least 3 hours or overnight.
9. Slice and serve.

That's it! Your Tiramisu is now ready to be enjoyed. You can also garnish it with cocoa powder or chocolate shavings before serving. Buon appetito!

real 0m29.113s
user 0m0.038s
sys 0m0.000s

The model is functional, but its speed leaves something to be desired. It took 29 seconds to generate this response. In contrast, when I tried the falcon-40b-instruct model for a quick comparison, it only took 10 seconds.

Here, I am sharing the updated versions of the YAML definitions required to deploy falcon-40b-instruct. Before proceeding, please ensure that your Kubernetes cluster has an adequate number of nodes equipped with GPUs in the node pool. This will allow you to make a direct comparison between the two models side by side.

---
apiVersion: v1
kind: Pod
metadata:
  name: falcon-40b-instruct
  labels:
    run: falcon-40b-instruct
spec:
  containers:
    - name: text-generation-inference
      image: ghcr.io/huggingface/text-generation-inference:0.9.3
      resources:
        limits:
          nvidia.com/gpu: 2
      env:
        - name: RUST_BACKTRACE
          value: "1"
      command:
        - "text-generation-launcher"
        - "--model-id"
        - "tiiuae/falcon-40b-instruct"
        - "--num-shard"
        - "2"
      ports:
        - containerPort: 80
          name: http
      volumeMounts:
        - name: falcon-40b-instruct
          mountPath: /data
        - name: shm
          mountPath: /dev/shm
  volumes:
    - name: falcon-40b-instruct
      persistentVolumeClaim:
        claimName: falcon-40b-instruct
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 1Gi
  nodeSelector:
    agentpool: gpunp
  tolerations:
    - key: sku
      operator: Equal
      value: gpu
      effect: NoSchedule
  restartPolicy: Never
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
    name: falcon-40b-instruct
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-csi-premium
  resources:
    requests:
      storage: 500Gi
---
apiVersion: v1
kind: Service
metadata:
  name: falcon-40b-instruct
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: falcon-40b-instruct
  type: ClusterIP

Now you can verify that making the exact same call to the falcon-40b-instruct model takes approximately 10 seconds:

shell:~# time curl -s http://falcon-40b-instruct/generate \
         -X POST \
         -d '{"inputs":"Can you give me step to step instructions to prepare Tiramisu?","parameters":{"max_new_tokens":1000}}'\
         -H 'Content-Type: application/json' | jq -r .generated_text

Sure! Here are the steps to prepare Tiramisu:

Ingredients:
- 3 eggs
- 1/2 cup sugar
- 1/2 cup mascarpone cheese
- 1/2 cup heavy cream
- 1/4 cup espresso
- 1/4 cup rum
- 1/2 cup ladyfingers
- 1/4 cup cocoa powder

Instructions:
1. Separate the eggs and beat the yolks with sugar until light and creamy.
2. In a separate bowl, beat the egg whites until stiff peaks form.
3. In another bowl, mix the mascarpone cheese and heavy cream until smooth.
4. Add the egg yolk mixture to the cheese mixture and mix well.
5. In a shallow dish, mix the espresso and rum.
6. Dip the ladyfingers in the espresso mixture and arrange them in a 9x9 inch baking dish.
7. Spread half of the cheese mixture over the ladyfingers.
8. Repeat steps 6 and 7 with the remaining ladyfingers and cheese mixture.
9. Sprinkle cocoa powder over the top.
10. Cover and refrigerate for at least 4 hours.
11. Serve chilled and enjoy!

real 0m10.202s
user 0m0.031s
sys 0m0.005s

Testing with LangChain agents and tools

Will the Llama-2–70b-chat-hf model be compatible and function well with LangChain agents and tools?

In my previous article, I attempted to test the utilization of LangChain agents and tools with the LangChain-based Cheshire Cat framework. Unfortunately, I encountered difficulties using the Tools with the falcon-40b-instruct model.

To simplify the process this time, I have created the simplest LangChain agent possible, along with the most basic tool. My aim is to inquire the model with a simple question like “what time is it?” while passing a one-line function as the tool to retrieve the current time.

Here is the testing code that I have packaged in a Docker container specifically designed to run on the Kubernetes cluster. I am providing both the Python code and the corresponding Dockerfile:

Using this straightforward code, I came to a crucial realization:
The entire setup is highly fragile, as even a minor modification to the prompt can lead the model to generate an uninterpretable output.
Although Llama2 performs slightly better when running the get_the_time tool, the resulting output remains unparseable by the LangChain agent.

While both models demonstrate the capability to employ the tools, the output they generate proves incompatible with LangChain’s parsing mechanism. Both models fail with the following exception:

langchain.schema.output_parser.OutputParserException: Parsing LLM output produced both a final answer and a parse-able action

This is coming from the following LangChain code:

class MRKLOutputParser(AgentOutputParser):
    """MRKL Output parser for the chat agent."""

    def get_format_instructions(self) -> str:
        return FORMAT_INSTRUCTIONS

    def parse(self, text: str) -> Union[AgentAction, AgentFinish]:
        includes_answer = FINAL_ANSWER_ACTION in text
        regex = (
            r"Action\s*\d*\s*:[\s]*(.*?)[\s]*Action\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
        )
        action_match = re.search(regex, text, re.DOTALL)
        if action_match:
            if includes_answer:
                raise OutputParserException(
                    "Parsing LLM output produced both a final answer "
                    f"and a parse-able action: {text}"
                )

langchain/libs/langchain/langchain/agents/mrkl/output_parser.py at…

⚡ Building applications with LLMs through composability ⚡ …

github.com

In the file advanced_agent.py, I attempted to develop a more sophisticated solution by creating my own class that overrides the AgentOutputParser. However, I ended up with an intricate and fragile approach that could never reliably parse the model’s output and integrate it with the results from the Tool.

Conclusion

After a few days of experimenting with various open source models, I found myself unable to create a simple and concise software application that effectively utilized LangChain Agents and Tools in conjunction with an open source large language model.

The most vulnerable aspect of this process turned out to be the prompt template. It became evident that there is a pressing need for a standardized prompt template that developers can readily employ across different language models. Presently, the models are published with varying and specific prompt keywords, necessitating developers to adapt their prompts significantly when switching between different LLMs. This situation is particularly evident when attempting to seamlessly transition LangChain examples from working with OpenAI GPT models to other open source LLMs; such a shift requires considerable effort and modification of the prompts.

In conclusion, the success of large language models lies not only in the model’s capabilities but also in the ease of integration with developer tools and frameworks. Establishing a uniform prompt template would greatly enhance the efficiency and adaptability of these models, making them more accessible and user-friendly for developers across the board. As the field of language models continues to evolve, standardization efforts like this are crucial for advancing the technology and promoting widespread adoption in various applications.