Generating Product Descriptions with Mistral-7B-Instruct-v0.2 with vLLM Serving

Date: December 2023

Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

Navigating the dynamic world of Large Language Models, we have been witnessing an unprecedented era of collective problem-solving and innovation this year. Given the ethical considerations, data privacy, and the potential for misuse around popular proprietary models, open source LLMs are particularly interesting with their inherent transparency that foster learning, experimentation, and growth across various disciplines.

In this blog post, we will go over another simple demo for generating product descriptions using a relatively newer open-source model, Mistral-7B-Instruct-v0.2 from Mistral AI, and serving the model using the vLLM framework. We will be building a Docker container and deploy it to Snowpark Container Services (Now in Public Preview in selected AWS regions!) for an easier and more flexible development experience.

Why Mistral AI models are something to consider?

Mistral AI offers two flavors of open source models (as of December 2023): Mistral 7B and Mixtral8x7B.

https://docs.mistral.ai/models/#sizes

Mistral 7B model, which we are using in this demo, is a 7.3B parameter model with the following characteristics: (Source: https://mistral.ai/news/announcing-mistral-7b/)

  • Outperforms Llama 2 13B on all benchmarks
  • Outperforms Llama 1 34B on many benchmarks
  • Approaches CodeLlama 7B performance on code, while remaining good at English tasks
  • Uses Grouped-query attention (GQA) for faster inference
  • Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost

For its size, Mistral 7B is the best performing model as of December 2023.

Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights, is a newer model with a lot more capabilities as listed below: (Source: https://mistral.ai/news/mixtral-of-experts/)

  • It gracefully handles a context of 32k tokens.
  • It handles English, French, Italian, German and Spanish.
  • It shows strong performance in code generation.
  • It can be finetuned into an instruction-following model that achieves a score of 8.3 on MT-Bench.

According to the Mistral AI’s blog post, “Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference.

What is the significance of vLLM framework when serving large(r) models?

vLLM is a framework developed to serve autoregressive LLMs more efficiently with limited resources. vLLM uses Paged Attention, an algorithm that stores continuous keys and values in non-contiguous memory space, and KV Cache for high throughput serving of LLMs. With these capabilities, vLLM allows for faster sequential inferences. vLLM seamlesly supports HuggingFace models and provide flexibility around serving options, such as leveraging an OpenAI-compatible API server or streaming. Installation is fairly easy and well-documented.

If you are new to vLLM, here is a great quickstart to start with: https://docs.vllm.ai/en/latest/getting_started/quickstart.html

After this brief introduction to Mistral AI model and vLLM, let’s dive into the steps for creating a simple demo for generating product descriptions for a fictitious company called, EJOffice, that sells office supplies. We will use Snowpark Container Services to host the Mistral 7B model and Jupyter to generate descriptions.

The Github repo for the code can be found here.

Here is the sample data we store in a Snowflake table with some product specifications, product features as well as target customer segments:

Product specifications like weight and dimensions can be also stored and used for generating descriptions in a real-world implementation.

To build our container, we are using the following spec.yaml file where we are specifying the mistralai/Mistral-7B-Instruct-v0.2 model to pull the model from HuggingFace.

spec:
container:
- name: vllm
image: "<Snowflake acct>.registry.snowflakecomputing.com/mistral_vllm_db/public/images/mistral"
volumeMounts:
- name: stage
mountPath: /workspace/stage
env:
LLM_MODEL: mistralai/Mistral-7B-Instruct-v0.2
HUGGINGFACE_TOKEN: <Your HuggingFace token>
API_BASE: 'http://127.0.0.1:8000'
SNOW_ROLE: SYSADMIN
SNOW_WAREHOUSE: BI_WH
SNOW_DATABASE: mistral_vllm_db
SNOW_SCHEMA: PUBLIC
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
volume:
- name: stage
source: "@FILES"
uid: 1000
gid: 1000
endpoint:
- name: mistral
port: 8000
public: true
- name: jupyter
port: 8888
public: true

We are using a single GPU with a Snowpark Container Services compute pool (GPU_3 instance family) that has 1 xNVIDIA A10G — 8 vCPU and 32 GB memory.

For this demo, we chose to use the vLLM OpenAI compatible serving approach, which is very simple to host:

#!/bin/bash
# Launch vLLM inference endpoint
huggingface-cli login --token $HUGGINGFACE_TOKEN
nohup python -m vllm.entrypoints.openai.api_server --model $LLM_MODEL > vllm.out 2>vllm.err &
( tail -f -n0 vllm.err & ) | grep -q "Uvicorn running"

We create our service using the code below:

CREATE SERVICE  mistral
IN COMPUTE POOL GPU_3_COMPUTE_POOL -- only 1 GPU needed for vllm
FROM @YAMLS
SPEC='spec.yaml'
MIN_INSTANCES=1
MAX_INSTANCES=1;

CALL SYSTEM$GET_SERVICE_LOGS('MISTRAL_VLLM_DB.PUBLIC.MISTRAL', '0', 'vllm');
CALL SYSTEM$GET_SERVICE_STATUS('MISTRAL_VLLM_DB.PUBLIC.MISTRAL', 100);


-- let's get the endpoints vLLM and Jupyter
SHOW ENDPOINTS IN SERVICE MISTRAL;

And, we get the endpoints generated by Snowpark Container Services:

At this point, we can launch Jupyter by using the jupyter endpoint and after authenticating with Snowflake, we can test the Mistral AI model endpoint with a simple http call by creating a Terminal (File > New > Terminal) in Jupyter:

Once successful, we can start executing the code in the sample notebook provided in the Github repo and generate product descriptions for our products providing the features we defined in our Snowflake table. Here is an example product description generated by the model:

Finally, we are storing the generated descriptions in a Snowflake table called PRODUCT_DESCRIPTIONS:

--

--

Eda Johnson
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

NVIDIA | AWS Machine Learning Specialty | Azure | Databricks | GCP | Snowflake Advanced Architect | Terraform certified Principal Product Architect