Implementing Locally-Hosted Llama2 Chat UI Using Streamlit

Moto DEI
8 min readJul 23, 2023

--

In this blog post, we’re going to build upon the chat feature I introduced in my previous article. In that piece, I utilized the OpenAI API to enable the chat to respond to my queries. However, this approach had its drawbacks, including the costs associated with API calls and the necessity for data exchange over the internet between your environment and OpenAI API endpoints.

Today, we’re going to switch gears and leverage the newly open-sourced LLM, Llama2, as our foundational model. By hosting the model locally and directing our chat queries to this local model, we can enjoy secure, cost-free chat conversations. So, let’s dive in!

The following image shows how it would look when everything is done in this post. It now has a new option llama-2-7b-chat.ggmlv3.q2_k as an LLM.

Screenshot from the final chat UI after this post. Pictured by the author.

My Related Posts

Setting the Stage: Essential Preparation

In this post, we don’t use the OpenAI API to test Llama2 but getting the OpenAI API key and putting it in the .env file is still essential preparation to avoid a possible path does not exist error. Please follow the instruction in my previous post if you haven’t yet.

There are a couple of additional preparations for the setup:

Download Llama2 model to your local environment

First things first, we need to download a Llama2 model to our local machine. You can find these models readily available in a Hugging Face repository. Given that we’re working with a local PC, it’s advisable to opt for the GGML version rather than the full version. But you might be wondering, what exactly is GGML?

GGML is a machine learning library designed to handle large models and deliver high performance on standard hardware. It uses a quantized representation of model weights, which essentially means it uses ‘approximated’ parameters as opposed to the full version, resulting in slightly less accuracy. However, the trade-off is a model that requires 4x less RAM, 4x less RAM bandwidth, and offers faster inference on the CPU. This makes GGML an ideal starting point for most local machines, particularly those not equipped with GPUs for machine learning or with limited RAM. Once you’ve got your setup ready, you can consider transitioning to a server-hosted Llama2. Many vendors, including AWS, have already released their support for Llama2.

Llama2 comes in various flavors, differentiated by the number of parameters (7 billion, 13 billion, or 70 billion) or by the tuning target (such as a plain version or one optimized for chat conversations). Given the constraints of my local PC, I’ve chosen to download the llama-2–7b-chat.ggmlv3.q2_K.bin model, which you can download here. This model is the most resource-efficient member of the Llama2 family, requiring the least amount of RAM and ROM space.

Run the following wget command on your terminal:

wget https://huggingface.co/localmodels/Llama-2-7B-Chat-ggml/resolve/main/llama-2-7b-chat.ggmlv3.q2_K.bin

Then, move to ./models/ folder in your project repo.

Pip install llama-cpp-python

llama.cpp is a library we need to run Llama2 models. Just download a Python library by pip . Alongside the necessary libraries, we discussed in the previous post, our complete requirements.txt file should now look as follows:

langchain==0.0.234
openai==0.27.8
python-dotenv==1.0.0
streamlit==1.24.1
llama-cpp-python==0.1.65

Main Code

Now, our complete code in app.py looks like this:

# app.py
from typing import List, Union

from dotenv import load_dotenv, find_dotenv
from langchain.callbacks import get_openai_callback
from langchain.chat_models import ChatOpenAI
from langchain.schema import (SystemMessage, HumanMessage, AIMessage)
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
import streamlit as st


def init_page() -> None:
st.set_page_config(
page_title="Personal ChatGPT"
)
st.header("Personal ChatGPT")
st.sidebar.title("Options")


def init_messages() -> None:
clear_button = st.sidebar.button("Clear Conversation", key="clear")
if clear_button or "messages" not in st.session_state:
st.session_state.messages = [
SystemMessage(
content="You are a helpful AI assistant. Reply your answer in mardkown format.")
]
st.session_state.costs = []


def select_llm() -> Union[ChatOpenAI, LlamaCpp]:
model_name = st.sidebar.radio("Choose LLM:",
("gpt-3.5-turbo-0613", "gpt-4",
"llama-2-7b-chat.ggmlv3.q2_K"))
temperature = st.sidebar.slider("Temperature:", min_value=0.0,
max_value=1.0, value=0.0, step=0.01)
if model_name.startswith("gpt-"):
return ChatOpenAI(temperature=temperature, model_name=model_name)
elif model_name.startswith("llama-2-"):
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
return LlamaCpp(
model_path=f"./models/{model_name}.bin",
input={"temperature": temperature,
"max_length": 2000,
"top_p": 1
},
callback_manager=callback_manager,
verbose=False, # True
)


def get_answer(llm, messages) -> tuple[str, float]:
if isinstance(llm, ChatOpenAI):
with get_openai_callback() as cb:
answer = llm(messages)
return answer.content, cb.total_cost
if isinstance(llm, LlamaCpp):
return llm(llama_v2_prompt(convert_langchainschema_to_dict(messages))), 0.0


def find_role(message: Union[SystemMessage, HumanMessage, AIMessage]) -> str:
"""
Identify role name from langchain.schema object.
"""
if isinstance(message, SystemMessage):
return "system"
if isinstance(message, HumanMessage):
return "user"
if isinstance(message, AIMessage):
return "assistant"
raise TypeError("Unknown message type.")


def convert_langchainschema_to_dict(
messages: List[Union[SystemMessage, HumanMessage, AIMessage]]) \
-> List[dict]:
"""
Convert the chain of chat messages in list of langchain.schema format to
list of dictionary format.
"""
return [{"role": find_role(message),
"content": message.content
} for message in messages]


def llama_v2_prompt(messages: List[dict]) -> str:
"""
Convert the messages in list of dictionary format to Llama2 compliant format.
"""
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
BOS, EOS = "<s>", "</s>"
DEFAULT_SYSTEM_PROMPT = f"""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

if messages[0]["role"] != "system":
messages = [
{
"role": "system",
"content": DEFAULT_SYSTEM_PROMPT,
}
] + messages
messages = [
{
"role": messages[1]["role"],
"content": B_SYS + messages[0]["content"] + E_SYS + messages[1]["content"],
}
] + messages[2:]

messages_list = [
f"{BOS}{B_INST} {(prompt['content']).strip()} {E_INST} {(answer['content']).strip()} {EOS}"
for prompt, answer in zip(messages[::2], messages[1::2])
]
messages_list.append(
f"{BOS}{B_INST} {(messages[-1]['content']).strip()} {E_INST}")

return "".join(messages_list)


def main() -> None:
_ = load_dotenv(find_dotenv())

init_page()
llm = select_llm()
init_messages()

# Supervise user input
if user_input := st.chat_input("Input your question!"):
st.session_state.messages.append(HumanMessage(content=user_input))
with st.spinner("ChatGPT is typing ..."):
answer, cost = get_answer(llm, st.session_state.messages)
st.session_state.messages.append(AIMessage(content=answer))
st.session_state.costs.append(cost)

# Display chat history
messages = st.session_state.get("messages", [])
for message in messages:
if isinstance(message, AIMessage):
with st.chat_message("assistant"):
st.markdown(message.content)
elif isinstance(message, HumanMessage):
with st.chat_message("user"):
st.markdown(message.content)

costs = st.session_state.get("costs", [])
st.sidebar.markdown("## Costs")
st.sidebar.markdown(f"**Total cost: ${sum(costs):.5f}**")
for cost in costs:
st.sidebar.markdown(f"- ${cost:.5f}")


# streamlit run app.py
if __name__ == "__main__":
main()

About a function `llama_v2_prompt()`

One important aspect to note in our code is the function llama_v2_prompt(). Unlike OpenAI’s GPT APIs, Llama2 doesn’t accept a dictionary formatted by system/human/AI answer conversations in the following manner:

messages = [
{"role": "system", "content": "You are a helpful AI assistant. Reply your answer in mardkown format."},
{"role": "user", "content": "Who directed The Dark Knight?"},
{"role": "assistant", "content": "The director of The Dark Knight is Christopher Nolan."},
{"role": "user", "content": "What are the other movies he directed?"}
]

Instead, it requires the conversation to be formatted into a single string, with each member’s contributions separated by markers such as <s> , <<SYS>> , and [INST]. See this blog post about the Llama2 release for further explanations. The function llama_v2_prompt() is designed to convert the dictionary, as shown above, into a string that’s compatible with Llama2. Here’s an example:

messages = [
{"role": "system", "content": "You are a helpful AI assistant. Reply your answer in mardkown format."},
{"role": "user", "content": "Who directed The Dark Knight?"},
{"role": "assistant", "content": "The director of The Dark Knight is Christopher Nolan."},
{"role": "user", "content": "What are the other movies he directed?"}
]
llama_v2_prompt(messages)
# '<s>[INST] <<SYS>>\nYou are a helpful AI assistant. Reply your answer in mardkown format.\n<</SYS>>\n\nWho directed The Dark Knight? [/INST] The director of The Dark Knight is Christopher Nolan. </s><s>[INST] What are the other movies he directed? [/INST]'

I owe a great deal to a discussion in this Hugging Face forum post, which helped me develop this function.

Launching the Streamlit Chat UI

We’re now ready to launch the Streamlit chat UI as demonstrated at the beginning of this post. Simply execute the following command, and voila! You’ll have your chat UI up and running on your localhost.

streamlit run app.py

It’s important to remember that we’re intentionally using a less accurate variant of Llama2 to facilitate chat services on your local PC. As a result, the chat accuracy might not be as high as that of ChatGPT, and the speed may also be somewhat slower. However, once we transition to running the model on a dedicated instance (most likely on a cloud service) with more robust specifications, these limitations should be mitigated. In other words, with the right setup, we can achieve a level of performance that’s more than satisfactory.

Looking Toward the Next Steps

There are several intriguing possibilities I’m eager to explore:

  • First, I plan to deploy this setup on a cloud instance with more robust specifications, thereby enhancing the chat provision. In addition to the official support from various cloud vendors, Hugging Face Inference Endpoints offer another promising avenue to explore. AWS also offers SageMaker JumpStart to one-click deploy and Amazon Bedrock for a fully-managed LLM host.
Screenshot from Hugging Face Inference Endpoints. Pictured by the author.

I‘ve actually attempted this once before writing this post, but encountered some issues with launching the instance with Llama2 models. I intend to delve deeper into this in the future.

  • As I mentioned in my previous post, I’m also considering augmenting the model’s ability to reference external documents when responding to questions. This feature would allow users to engage in conversations based on the information contained in the referenced documents, which could include anything from text files and CSVs, to PDFs and even YouTube videos.
  • The importance of LLM fine-tuning cannot be overstated. It’s a crucial step in tailoring its capabilities to specific tasks or domains, optimizing the model’s performance to improve its response accuracy, thereby increasing its effectiveness and utility. OpenAI has a nice page to show how we can fine-tune some of their models.
  • Lastly, as discussed in the previous post, ChatGPT’s Code Interpreter feature has been a remarkable addition to the chat functionality. A potential extension of my Streamlit chat could merge these features with an open-source LLM to emulate the Code Interpreter. This would enable users to run confidential data through the LLM for analysis, enhancing the utility and versatility of the chat interface.

--

--

Moto DEI

Principal Engineer/Data Scientist and Actuary with 20 yrs exp in media, marketing, insurance, and healthcare. https://www.linkedin.com/in/moto-dei-358abaa/