Building an LLM Chatbot on a Mac M1
This is a tutorial on how to integrate a generative model as a local server using LLAMA2 and **GASP** “CPU” like resources.
This tutorial is for students to locally query a large language model (LLM). I would recommend other methods for learning how to implement on the cloud like this AWS tutorial or a better local implementation like LocalAI. Even Reddit has some good resources for learning to build locally. The purpose of this tutorial is to allow you to implement a chatbot using Python on a Mac or Linux system. All the code associated with this post is available on GitHub.
Installing requirements
The requirements for running this on an M1 are in part obtained through the GitHub requirements.txt file which can be used to build an Anaconda environment. For those that do not have Anaconda find it here. Download the GitHub folder and build the chatbot-llm environment with the following command:
conda create -n chatbot-llm --file requirements.txt python=3.10
conda activate chatbot-llm
Next, we need to install some other packages using pip that are not available via conda. In addition, for the LLM to work on a Mac or Linux system we must set the cmake arguments using the command below.
# Linux and Mac
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
pip install sse_starlette
pip install starlette_context
pip install pydantic_settings
Downloading and activating the LLAMA-2 model
Now it is time to download the model. For this example, we are using a relatively small LLM (only?!?! about 4.78 GB). You can download the model from Hugging Face.
mkdir -p models/7B
wget -O models/7B/llama-2-7b-chat.Q5_K_M.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true
Once the model and the packages have been installed, we are now ready to run the LLM locally. We begin by calling the llama_cpp.server with the downloaded LLAMA-2 model. This combination acts like ChatGPT (server) and GPT-4 (model) respectively.
python3 -m llama_cpp.server --model models/7B/llama-2-7b-chat.Q5_K_M.gguf
Querying the model
This will start a server on localhost:8000 that we can query in the next step. The server and model are now ready for user input. We are querying the server and model using query.py with our question of choice. To begin querying, we should open a new terminal tab and activate our conda environment again.
conda activate chatbot-llm
In the current query.py file, the content portion within the messages list is what you as a user can change to get a different response from the model. Also, the max_tokens parameter allows the user to adjust the length of the LLM response to the input. **Note** If your max tokens are less than a projected response, the text may be cut off mid-sentence. Our prompt is as follows:
“Tell me about the starter Pokémon from the first generation of games.”
To run the query against the model, we call the query script.
export MODEL="models/7B/llama-2-7b-chat.Q5_K_M.gguf"
python query.py
After running the query script, there is a pause that could be somewhat substantial depending on your question. In our case, the response from the model is not given for almost 3 MINUTES?!?! (179.966 s). That seems like a long time and it is compared to running the models online, but all the computation is performed locally on the available hardware. Limitations of memory, CPU processing speeds, and the lack of other optimizations make this process a lot longer. Even though it takes a while here is the output with max_tokens = 500:
“Tell me about the starter Pokémon from the first generation of games.”
Of course! The first generation of Pokémon games, also known as Generation I, includes the following starter Pokémon:
1. Bulbasaur (Grass/Poison-type) — A plant-like Pokémon with a green and brown body, Bulbasaur is known for its ability to photosynthesize and use its vines to attack its opponents.
2. Charmander (Fire-type) — A lizard-like Pokémon with a orange and yellow body, Charmander is known for its fiery personality and its ability to breathe fire.
3. Squirtle (Water-type) — A turtle-like Pokémon with a blue and red body, Squirtle is known for its speed and agility in the water, as well as its ability to shoot powerful water jets.
Each of these starter Pokémon has unique abilities and characteristics that make them well-suited to different battle strategies and playstyles. Which one would you like to know more about?
This response is really detailed given the bluntness of the query and an exciting demonstration of the power of LLMs. I would not recommend running these models using serial processing (CPUs and “CPU” like on an M1) due to the time it takes to complete the response. If available, try to run local models using a GPU, which would speed up your processing time, or just be like me and use ChatGPT from OpenAI.
Recap and acknowledgments
In this demonstration, we installed an LLM server (llama_cpp.server) and model (LLAMA-2) locally on a Mac. We were able to deploy our very own local LLM. Then we were able to query the server/model and adjust the size of the response. Congratulations you have built your very own LLM! The inspiration for this work and some of the code building blocks are derived from Youness Mansar. Feel free to use or share the code which is available on GitHub. My name is and I can be found on LinkedIn. Be sure to check out some of my other articles for projects spanning a wide range of data science and machine learning topics.