One-Click Deployment of Llama2 and Other Open-Source LLMs Using Hugging Face Inference Endpoint

6 min readJul 30, 2023

In my earlier articles, we explored the process of setting up a ChatGPT-like user interface on your personal computer. This was achieved using the OpenAI API, specifically GPT-3.5 and GPT-4, in conjunction with Llama2.

In my previous articles, especially the second and third ones, I demonstrated the feasibility of hosting Llama2 and running a chat UI on it using the GGML version of Llama2. If you’ve attempted this, you may have noticed that the conversation speed is quite slow unless your local machine is equipped with ample memory and a robust GPU.

This brings us to the ideal solution: leveraging a cloud instance with superior specifications, such as more memory and GPU cores, to host the API service. Your local machine would then simply call this service, without having to run the LLM inference. This setup bears similarities to the OpenAI API, but with a crucial difference: you retain control over the instance and the model. This means you don’t need to send data externally, nor do you have to pay per API call. You only pay for the usage of the instance.

So, let’s delve into how we can accomplish this! In this article, we won’t be tinkering with the chat UI. Instead, we’ll concentrate solely on deploying Llama2.

Hugging Face Inference Endpoints

There are several cloud vendors, such as Azure and AWS, that officially support LLM hosting services. If you’re planning to fine-tune the model and load a custom LLM onto the instance, I highly recommend using these vendors due to their seamless integration with other services. However, if you’re simply looking to deploy the base model on a more powerful machine without the need for extensive instance preparation, Hugging Face Inference Endpoints is the most straightforward option.

To get started, navigate to your Hugging Face interface and search for Solution -> Inference Endpoints.

Then, your window will be like this.

Select the model you wish to deploy. If you can’t find your preferred model, it’s likely that support for that particular model hasn’t been implemented yet. Too bad :(

For this demonstration, I’ve chosen meta-llama/Llama-2-7b-chat-hf . This is the non-GGML version of the Llama7 7B model, which I can’t run locally due to insufficient memory on my laptop.

Below, you’ll find options to select your cloud vendor, region, instance type, and other advanced configurations. Now you can leave these settings at their default values.

Also, take note that the platform provides an estimate of how much you’ll be charged for using the instance.

Once you’re ready, click the Create Endpoint button to start the deployment process. That’s all there is to it!

In my experience, it typically takes around 5 minutes for the deployment endpoint to be ready.

Test Endpoint on Hugging Face Site

If you scroll down to the bottom of the endpoint Overview page, you’ll find a section for testing the endpoint. Here, you can experiment with your own examples and observe the responsiveness of the Llama2 7B model.

Don’t worry if the LLM’s response appears to be cut short in this section. This is merely a test to ensure that the model is functioning correctly and providing relevant responses.

Use Deployed Endpoint

To utilize the endpoint you’ve just created, locate the endpoint URL on the Overview page and make a note of it. We’ll be using this URL in the later steps.

Next, you’ll need to obtain your Hugging Face access token. You can find this on your account page. Make sure to note this down as well, as we’ll be using it shortly.

Next, navigate to your project repository. Within the .env file located in the project's home folder, input the environment variables you've just noted down.

HF_API_KEY=<YOUR HUGGING FACE API KEY>
HF_API_LLAMA2_BASE=<YOUR ENDPOINT FROM HUGGING FACE> something like https://xxxxxxx.us-east-n.aws.endpoints.huggingface.cloud

For the purpose of this post, you won’t need many packages. Your requirements.txt file should look like this:

python-dotenv==1.0.0
text-generation==0.6.0

The text-generation library (this) offers a convenient interface for interacting with a text-generation-inference instance running on Hugging Face Inference Endpoints.

Now that we have everything set up, let’s interact with the endpoint using some actual code. For this, we’ll be using Jupyter Notebook. Here are the initial setup steps:

from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())
from text_generation import Client

Now, let’s test the following code:

client = Client(os.environ["HF_API_LLAMA2_BASE"],
                headers={"Authorization": f"Bearer {os.environ['HF_API_KEY']}"},
                timeout=120)
prompt = "What is Super Bowl?"
client.generate(prompt, max_new_tokens=1000).generated_text

Here’s the output from my test (it took less than a second to generate):

It appears to have been successful!

There are several parameters within the client.generate() function that you can experiment with. If the model's response isn't quite to your liking, try adjusting these parameters.

For Llama2, if you’re running more than one question-answer sequence with the model, remember that Llama2 chat model prefers specific syntaxes, as I discussed in my previous post.

A conversation history should be reformatted into a single string, separated by markers such as <s>, <<SYS>>, and [INST]. Here's an example:

messages = [
    {"role": "system", "content": "You are a helpful AI assistant. Reply your answer in mardkown format."},
    {"role": "user", "content": "Who directed The Dark Knight?"},
    {"role": "assistant", "content": "The director of The Dark Knight is Christopher Nolan."},
    {"role": "user", "content": "What are the other movies he directed?"}
]
llama_v2_prompt(messages)
# '<s>[INST] <<SYS>>\nYou are a helpful AI assistant. Reply your answer in mardkown format.\n<</SYS>>\n\nWho directed The Dark Knight? [/INST] The director of The Dark Knight is Christopher Nolan. </s><s>[INST] What are the other movies he directed? [/INST]'

Also, refer to my previous post to understand what the llama_v2_prompt() function entails.

For some reason, this prompt format is not clearly described anywhere on the official web pages by Meta, but only appears on a Hugging Face blog post about the Llama2 release (I don’t get why!). At least, this post is the most complete instruction to the extent of my current recognition.
Other than that, Meta’s Llama2 page on Hugging Face Hub just quickly mentions this format stuff. This page of TheBloke/Llama-2–7B-Chat-GGML is somewhat easier to follow (see “Prompt template: Llama-2-Chat” section). There’s also a reddit post by “Chief Llama Office at Hugging Face”.

Now, if I use this long string with the new endpoint and display it in markdown format, it will appear as follows:

from IPython.display import display, Markdown
prompt = '<s>[INST] <<SYS>>\nYou are a helpful AI assistant. Reply your answer in mardkown format. Answer concisely and don\'t make up. If you don\'t know the answer just say I don\'t know. \n<</SYS>>\n\nWho directed The Dark Knight? [/INST] The director of The Dark Knight is Christopher Nolan. </s><s>[INST] What are the other movies he directed? [/INST]'
display(Markdown(client.generate(
    prompt,
    temperature=0.01,
    max_new_tokens=1000).generated_text))

Without a doubt, it’s working exceptionally well!

So, we’ve successfully deployed the non-GGML Llama2 model on a cloud instance and are able to interact with it through API calls. Now we’re ready to delve into more advanced uses of the LLM!

My Related Posts

Implementing Local ChatGPT Using Streamlit
Implementing Locally-Hosted Llama2 Chat UI Using Streamlit
Integrating the ChatPDF Feature into a Local Streamlit Chat Interface, Including Non-OpenAI Models (Llama2)
(This post) One-Click Deployment of Llama2 and Other Open-Source LLMs Using Hugging Face Inference Endpoint

One-Click Deployment of Llama2 and Other Open-Source LLMs Using Hugging Face Inference Endpoint

Hugging Face Inference Endpoints

Test Endpoint on Hugging Face Site

Use Deployed Endpoint

My Related Posts

Written by Moto DEI