Use Llama2 as an MLflow Model

Published in

InfinStor

6 min readSep 18, 2023

This article outlines a method for registering Meta’s Llama2 model as an MLflow model. We can then use the registered MLflow model for two tasks — chat and embedding generation.

The reason to do this is to enjoy the benefits of MLflow Model Management such as model sharing with colleagues, authorization for such sharing, model lifecycle management such as putting model versions in staging, production, archived state, etc.

The model and software packages that I am going to use for this purpose are:

Llama 2 Model and free license from Meta
llama.cpp open source project and its python binding
Custom MLflow pyfunc model that I developed for this purpose

The first half of this article is an explanation of these three components, and the second half is step by step instructions for implementing this design.

Meta’s Llama2 Model, llama.cpp and MLflow

Meta’s Llama2 model is a high quality LLM with free licensing that permit commercial use. The only two restrictions are that the monthly active user count cannot be over 350M and the results of Llama2 inference may not be used for training other models.

llama.cpp is a nice piece of software useful for LLM inference projects. Quoting from llama.cpp’s website ‘The main goal of llama.cpp is to run the LLaMA model using 4-bit integer quantization on a MacBook’

The combination of Meta’s Llama2 models and llama.cpp is very useful. You can run something similar to ChatGPT in your own laptop/desktop. So, where does MLflow fit in the picture?

MLflow includes the industry’s most popular model management tools. Models can be registered with MLflow, and then shared among users. Enterprise quality MLflow offerings such as Databricks and InfinStor also include functionality for ‘authorization’ of MLflow models. Users and user groups can be configured for read, write or admin privileges for MLflow models. MLflow model lifecycle management is a key part of MLflow. Administrators can mark specific model versions as ‘Production’, ‘Staging’, or ‘Archived’. Finally, mlflow can be used to track finetuned versions of LLMs, along with all the details of the training process used to perform finetuning.

MLflow Custom pyfunc model

The mlflow.pyfunc module defines utilities for creating custom pyfunc models using frameworks and inference logic that may not be natively included in MLflow. See Creating custom Pyfunc models.

MLflow Custom Pyfunc model is a well thought out interface that makes it possible for us to add support for Llama2/llama.cpp.

Source Code

The gihub project logmodel is where all of this project source code is present. https://github.com/jagane-infinstor/logmodel.git

Model Input Schema for the Chat task

If you look at logmodel/llama2-gguf/log.py in the above source tree, the input schema that we use for the Chat task is as follows:

input_schema = Schema([ColSpec(DataType.string, "role", False), ColSpec(DataType.string, "message", False)])

The two columns we send as input to the model are role and message. The acceptable values for role are system, user and assistant. These are the Llama2 prompt entities as explained in this huggingface blog:

Llama 2 is here - get it on Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

For best results, the prompt must be constructed in the same format that was used when the model was trained. The source code in logmodel/llama2-gguf/customloader/chatloader.py builds the prompt in the format described above and sends it to the model.

Step by Step Instructions

Here are step by step instructions for using this model and software.

Step 1: Create conda env

First, let’s create a conda environment suitable for llama-cpp-python, the python bindings package for llama.cpp. The following are the commands that I used to create a conda environment called llamacpp

conda create -n llamacpp python=3.9
conda activate llamacpp
pip install transformers[torch]
pip install mlflow>=2.6.0 numpy scipy pandas scikit-learn cloudpickle sentencepiece infinstor_mlflow_plugin
FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Note that I am not compiling the llama.cpp package for use with Nvidia GPUs. The commands here will work by using the CPU for inference.

Step 2: Llama2 License

Llama2 is available for commercial use, but you must obtain a free license from Meta. Go to the following Meta website and sign up for access to the Llama2 models: here

Step 3: Download HF Llama2 model

Now, download huggingface Llama2 model

git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

Note that you must login using your huggingface username and access token in order to complete the above command. Note also that this very same huggingface account must be authorized to download Llama2 models.

Step 4: Convert Llama2 to gguf format

Llama.cpp requires models to be in a specific format called the gguf format. Checkout the llama.cpp source code from github

git clone https://github.com/ggerganov/llama.cpp.git

Next, use the convert.py utility included in llama.cpp to convert the downloaded model to gguf format

(cd Llama-2-7b-chat-hf; python ../llama.cpp/convert.py --outtype q8_0 .)

The output should look similar to the following:

-rw-rw-r-- 1 jagane jagane  7161089696 Sep 10 21:53 ggml-model-q8_0.gguf

Step 5: Log Model to MLflow for Chat

I have developed an MLflow custom pyfunc model for llama.cpp Llama2 model. This custom pyfunc model supports two tasks

Chat
Embedding generation

This software, called logmodel, is Apache Licensed and available on github. We now use logmodel to log the Meta Llama2 model in llama.cpp gguf format to MLflow model and turn it into an MLflow model. The following example is for the chat task. Note that the model file is large and the second command below could take some time.

git clone https://github.com/jagane-infinstor/logmodel.git
(cd logmodel/llama2-gguf; python log.py --data_path ../../Llama-2-7b-chat-hf/ggml-model-q8_0.gguf --task chat)

Step 6: Register Model

Now, press the ‘Register Model’ button and specify a new model named llama2-gguf-chat

Step 7: Test the logged Chat model

The program chat.py included in the logmodel github tree is useful for testing the logged model

python chat.py --model models:/llama2-gguf-chat/1

The output will be something like the following:

> What is the capital of California?

llama_print_timings:        load time =  3552.53 ms
llama_print_timings:      sample time =    51.67 ms /    78 runs   (    0.66 ms per token,  1509.58 tokens per second)
llama_print_timings: prompt eval time =  3552.49 ms /    94 tokens (   37.79 ms per token,    26.46 tokens per second)
llama_print_timings:        eval time = 20680.57 ms /    77 runs   (  268.58 ms per token,     3.72 tokens per second)
llama_print_timings:       total time = 24444.20 ms
assistant>   Thank you for asking! The capital of California is Sacramento. It is located in the northern part of the state, along the Sacramento River. Sacramento has a rich history and culture, and it is home to many important government buildings and institutions, including the California State Capitol. I hope that helps! Let me know if you have any other questions.
> How far is it from San Francisco?
Llama.generate: prefix-match hit

llama_print_timings:        load time =  3552.53 ms
llama_print_timings:      sample time =    63.20 ms /   101 runs   (    0.63 ms per token,  1598.08 tokens per second)
llama_print_timings: prompt eval time =  3425.73 ms /    96 tokens (   35.68 ms per token,    28.02 tokens per second)
llama_print_timings:        eval time = 25718.24 ms /   100 runs   (  257.18 ms per token,     3.89 tokens per second)
llama_print_timings:       total time = 29382.87 ms
assistant>   Great question! Sacramento is located approximately 150 miles (241 kilometers) northeast of San Francisco. The drive from San Francisco to Sacramento typically takes about 2-3 hours, depending on traffic and the route you take. There are also public transportation options available, such as buses and trains, which can take a bit longer but offer a convenient alternative to driving. I hope that helps! Let me know if you have any other questions.
>

Step 8: Log Model to MLflow for Embedding Generation

Next, we use logmodel to log the Meta Llama2 model in llama.cpp gguf format the embedding generation task.

git clone https://github.com/jagane-infinstor/logmodel.git
(cd logmodel/llama2-gguf; python log.py --data_path ../../Llama-2-7b-chat-hf/ggml-model-q8_0.gguf --task embedding-generation)

The above command causes the llama.cpp gguf format model to be logged as an MLflow model for the embedding generation task. You can now go to the mlflow GUI and select the specific experiment/run. The model will be displayed in the artifacts pane for that run. Go ahead and register the model, for example, as llama2-gguf-embedding

Step 9: Test the logged Embeddings model

The program embeddings.py included in the logmodel github tree is useful for testing the logged model

python embeddings.py --model models:/llama2-gguf-embedding/1

Now, when you type in a sentence at the > prompt, the program will print out the embeddings generated by the model.

That’s all folks!