Use Llama2 as an MLflow Model
This article outlines a method for registering Meta’s Llama2 model as an MLflow model. We can then use the registered MLflow model for two tasks — chat and embedding generation.
The reason to do this is to enjoy the benefits of MLflow Model Management such as model sharing with colleagues, authorization for such sharing, model lifecycle management such as putting model versions in staging, production, archived state, etc.
The model and software packages that I am going to use for this purpose are:
- Llama 2 Model and free license from Meta
- llama.cpp open source project and its python binding
- Custom MLflow pyfunc model that I developed for this purpose
The first half of this article is an explanation of these three components, and the second half is step by step instructions for implementing this design.
Meta’s Llama2 Model, llama.cpp and MLflow
Meta’s Llama2 model is a high quality LLM with free licensing that permit commercial use. The only two restrictions are that the monthly active user count cannot be over 350M and the results of Llama2 inference may not be used for training other models.
llama.cpp is a nice piece of software useful for LLM inference projects. Quoting from llama.cpp’s website ‘The main goal of llama.cpp
is to run the LLaMA model using 4-bit integer quantization on a MacBook’
The combination of Meta’s Llama2 models and llama.cpp is very useful. You can run something similar to ChatGPT in your own laptop/desktop. So, where does MLflow fit in the picture?
MLflow includes the industry’s most popular model management tools. Models can be registered with MLflow, and then shared among users. Enterprise quality MLflow offerings such as Databricks and InfinStor also include functionality for ‘authorization’ of MLflow models. Users and user groups can be configured for read, write or admin privileges for MLflow models. MLflow model lifecycle management is a key part of MLflow. Administrators can mark specific model versions as ‘Production’, ‘Staging’, or ‘Archived’. Finally, mlflow can be used to track finetuned versions of LLMs, along with all the details of the training process used to perform finetuning.
MLflow Custom pyfunc model
The mlflow.pyfunc
module defines utilities for creating custom pyfunc
models using frameworks and inference logic that may not be natively included in MLflow. See Creating custom Pyfunc models.
MLflow Custom Pyfunc model is a well thought out interface that makes it possible for us to add support for Llama2/llama.cpp.
Source Code
The gihub project logmodel is where all of this project source code is present. https://github.com/jagane-infinstor/logmodel.git
Model Input Schema for the Chat task
If you look at logmodel/llama2-gguf/log.py in the above source tree, the input schema that we use for the Chat task is as follows:
input_schema = Schema([ColSpec(DataType.string, "role", False), ColSpec(DataType.string, "message", False)])
The two columns we send as input to the model are role and message. The acceptable values for role are system, user and assistant. These are the Llama2 prompt entities as explained in this huggingface blog:
For best results, the prompt must be constructed in the same format that was used when the model was trained. The source code in logmodel/llama2-gguf/customloader/chatloader.py builds the prompt in the format described above and sends it to the model.
Step by Step Instructions
Here are step by step instructions for using this model and software.
Step 1: Create conda env
First, let’s create a conda environment suitable for llama-cpp-python, the python bindings package for llama.cpp. The following are the commands that I used to create a conda environment called llamacpp
conda create -n llamacpp python=3.9
conda activate llamacpp
pip install transformers[torch]
pip install mlflow>=2.6.0 numpy scipy pandas scikit-learn cloudpickle sentencepiece infinstor_mlflow_plugin
FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
Note that I am not compiling the llama.cpp package for use with Nvidia GPUs. The commands here will work by using the CPU for inference.
Step 2: Llama2 License
Llama2 is available for commercial use, but you must obtain a free license from Meta. Go to the following Meta website and sign up for access to the Llama2 models: here
Step 3: Download HF Llama2 model
Now, download huggingface Llama2 model
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
Note that you must login using your huggingface username and access token in order to complete the above command. Note also that this very same huggingface account must be authorized to download Llama2 models.
Step 4: Convert Llama2 to gguf format
Llama.cpp requires models to be in a specific format called the gguf format. Checkout the llama.cpp source code from github
git clone https://github.com/ggerganov/llama.cpp.git
Next, use the convert.py utility included in llama.cpp to convert the downloaded model to gguf format
(cd Llama-2-7b-chat-hf; python ../llama.cpp/convert.py --outtype q8_0 .)
The output should look similar to the following:
-rw-rw-r-- 1 jagane jagane 7161089696 Sep 10 21:53 ggml-model-q8_0.gguf
Step 5: Log Model to MLflow for Chat
I have developed an MLflow custom pyfunc model for llama.cpp Llama2 model. This custom pyfunc model supports two tasks
- Chat
- Embedding generation
This software, called logmodel, is Apache Licensed and available on github. We now use logmodel to log the Meta Llama2 model in llama.cpp gguf format to MLflow model and turn it into an MLflow model. The following example is for the chat task. Note that the model file is large and the second command below could take some time.
git clone https://github.com/jagane-infinstor/logmodel.git
(cd logmodel/llama2-gguf; python log.py --data_path ../../Llama-2-7b-chat-hf/ggml-model-q8_0.gguf --task chat)
Step 6: Register Model
Login to the MLflow web ui and choose the experiment/run created by the above command.
Now, press the ‘Register Model’ button and specify a new model named llama2-gguf-chat
Step 7: Test the logged Chat model
The program chat.py included in the logmodel github tree is useful for testing the logged model
python chat.py --model models:/llama2-gguf-chat/1
The output will be something like the following:
> What is the capital of California?
llama_print_timings: load time = 3552.53 ms
llama_print_timings: sample time = 51.67 ms / 78 runs ( 0.66 ms per token, 1509.58 tokens per second)
llama_print_timings: prompt eval time = 3552.49 ms / 94 tokens ( 37.79 ms per token, 26.46 tokens per second)
llama_print_timings: eval time = 20680.57 ms / 77 runs ( 268.58 ms per token, 3.72 tokens per second)
llama_print_timings: total time = 24444.20 ms
assistant> Thank you for asking! The capital of California is Sacramento. It is located in the northern part of the state, along the Sacramento River. Sacramento has a rich history and culture, and it is home to many important government buildings and institutions, including the California State Capitol. I hope that helps! Let me know if you have any other questions.
> How far is it from San Francisco?
Llama.generate: prefix-match hit
llama_print_timings: load time = 3552.53 ms
llama_print_timings: sample time = 63.20 ms / 101 runs ( 0.63 ms per token, 1598.08 tokens per second)
llama_print_timings: prompt eval time = 3425.73 ms / 96 tokens ( 35.68 ms per token, 28.02 tokens per second)
llama_print_timings: eval time = 25718.24 ms / 100 runs ( 257.18 ms per token, 3.89 tokens per second)
llama_print_timings: total time = 29382.87 ms
assistant> Great question! Sacramento is located approximately 150 miles (241 kilometers) northeast of San Francisco. The drive from San Francisco to Sacramento typically takes about 2-3 hours, depending on traffic and the route you take. There are also public transportation options available, such as buses and trains, which can take a bit longer but offer a convenient alternative to driving. I hope that helps! Let me know if you have any other questions.
>
Step 8: Log Model to MLflow for Embedding Generation
Next, we use logmodel to log the Meta Llama2 model in llama.cpp gguf format the embedding generation task.
git clone https://github.com/jagane-infinstor/logmodel.git
(cd logmodel/llama2-gguf; python log.py --data_path ../../Llama-2-7b-chat-hf/ggml-model-q8_0.gguf --task embedding-generation)
The above command causes the llama.cpp gguf format model to be logged as an MLflow model for the embedding generation task. You can now go to the mlflow GUI and select the specific experiment/run. The model will be displayed in the artifacts pane for that run. Go ahead and register the model, for example, as llama2-gguf-embedding
Step 9: Test the logged Embeddings model
The program embeddings.py included in the logmodel github tree is useful for testing the logged model
python embeddings.py --model models:/llama2-gguf-embedding/1
Now, when you type in a sentence at the > prompt, the program will print out the embeddings generated by the model.
That’s all folks!