Generating Product Descriptions with Mistral-7B-Instruct-v0.2 with vLLM Serving

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

5 min readDec 19, 2023

Date: December 2023

Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

Navigating the dynamic world of Large Language Models, we have been witnessing an unprecedented era of collective problem-solving and innovation this year. Given the ethical considerations, data privacy, and the potential for misuse around popular proprietary models, open source LLMs are particularly interesting with their inherent transparency that foster learning, experimentation, and growth across various disciplines.

In this blog post, we will go over another simple demo for generating product descriptions using a relatively newer open-source model, Mistral-7B-Instruct-v0.2 from Mistral AI, and serving the model using the vLLM framework. We will be building a Docker container and deploy it to Snowpark Container Services (Now in Public Preview in selected AWS regions!) for an easier and more flexible development experience.

Why Mistral AI models are something to consider?

Mistral AI offers two flavors of open source models (as of December 2023): Mistral 7B and Mixtral8x7B.

Mistral 7B model, which we are using in this demo, is a 7.3B parameter model with the following characteristics: (Source: https://mistral.ai/news/announcing-mistral-7b/)

Outperforms Llama 2 13B on all benchmarks
Outperforms Llama 1 34B on many benchmarks
Approaches CodeLlama 7B performance on code, while remaining good at English tasks
Uses Grouped-query attention (GQA) for faster inference
Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost

For its size, Mistral 7B is the best performing model as of December 2023.

Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights, is a newer model with a lot more capabilities as listed below: (Source: https://mistral.ai/news/mixtral-of-experts/)

It gracefully handles a context of 32k tokens.
It handles English, French, Italian, German and Spanish.
It shows strong performance in code generation.
It can be finetuned into an instruction-following model that achieves a score of 8.3 on MT-Bench.

According to the Mistral AI’s blog post, “Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference.”

What is the significance of vLLM framework when serving large(r) models?

vLLM is a framework developed to serve autoregressive LLMs more efficiently with limited resources. vLLM uses Paged Attention, an algorithm that stores continuous keys and values in non-contiguous memory space, and KV Cache for high throughput serving of LLMs. With these capabilities, vLLM allows for faster sequential inferences. vLLM seamlesly supports HuggingFace models and provide flexibility around serving options, such as leveraging an OpenAI-compatible API server or streaming. Installation is fairly easy and well-documented.

If you are new to vLLM, here is a great quickstart to start with: https://docs.vllm.ai/en/latest/getting_started/quickstart.html

After this brief introduction to Mistral AI model and vLLM, let’s dive into the steps for creating a simple demo for generating product descriptions for a fictitious company called, EJOffice, that sells office supplies. We will use Snowpark Container Services to host the Mistral 7B model and Jupyter to generate descriptions.

The Github repo for the code can be found here.

Here is the sample data we store in a Snowflake table with some product specifications, product features as well as target customer segments:

Product specifications like weight and dimensions can be also stored and used for generating descriptions in a real-world implementation.

To build our container, we are using the following spec.yaml file where we are specifying the mistralai/Mistral-7B-Instruct-v0.2 model to pull the model from HuggingFace.

spec:
  container:
  - name: vllm
    image: "<Snowflake acct>.registry.snowflakecomputing.com/mistral_vllm_db/public/images/mistral"
    volumeMounts:
      - name: stage
        mountPath: /workspace/stage
    env:
      LLM_MODEL: mistralai/Mistral-7B-Instruct-v0.2
      HUGGINGFACE_TOKEN: <Your HuggingFace token>
      API_BASE: 'http://127.0.0.1:8000'
      SNOW_ROLE: SYSADMIN
      SNOW_WAREHOUSE: BI_WH
      SNOW_DATABASE: mistral_vllm_db
      SNOW_SCHEMA: PUBLIC
    resources:
      requests:
        nvidia.com/gpu: 1
      limits:
        nvidia.com/gpu: 1
  volume:
  - name: stage
    source: "@FILES"
    uid: 1000
    gid: 1000
  endpoint:
  - name: mistral
    port: 8000
    public: true
  - name: jupyter  
    port: 8888
    public: true

We are using a single GPU with a Snowpark Container Services compute pool (GPU_3 instance family) that has 1 xNVIDIA A10G — 8 vCPU and 32 GB memory.

For this demo, we chose to use the vLLM OpenAI compatible serving approach, which is very simple to host:

#!/bin/bash
# Launch vLLM inference endpoint
huggingface-cli login --token $HUGGINGFACE_TOKEN
nohup python -m vllm.entrypoints.openai.api_server --model $LLM_MODEL > vllm.out 2>vllm.err &
( tail -f -n0 vllm.err & ) | grep -q "Uvicorn running"

We create our service using the code below:

CREATE SERVICE  mistral
  IN COMPUTE POOL GPU_3_COMPUTE_POOL -- only 1 GPU needed for vllm
  FROM @YAMLS
  SPEC='spec.yaml'
  MIN_INSTANCES=1
  MAX_INSTANCES=1;

  CALL SYSTEM$GET_SERVICE_LOGS('MISTRAL_VLLM_DB.PUBLIC.MISTRAL', '0', 'vllm');
  CALL SYSTEM$GET_SERVICE_STATUS('MISTRAL_VLLM_DB.PUBLIC.MISTRAL', 100); 


-- let's get the endpoints vLLM and Jupyter
  SHOW ENDPOINTS IN SERVICE MISTRAL;

And, we get the endpoints generated by Snowpark Container Services:

At this point, we can launch Jupyter by using the jupyter endpoint and after authenticating with Snowflake, we can test the Mistral AI model endpoint with a simple http call by creating a Terminal (File > New > Terminal) in Jupyter:

Once successful, we can start executing the code in the sample notebook provided in the Github repo and generate product descriptions for our products providing the features we defined in our Snowflake table. Here is an example product description generated by the model:

Finally, we are storing the generated descriptions in a Snowflake table called PRODUCT_DESCRIPTIONS:

In this simple blog post, we demonstrated how to use Mistral 7B Instruct model by hosting it with vLLM with Snowpark Container Services all within your Snowflake account.

Related Links:

Generating Product Descriptions with Mistral-7B-Instruct-v0.2 with vLLM Serving

Why Mistral AI models are something to consider?

What is the significance of vLLM framework when serving large(r) models?

Written by Eda Johnson