A Brief Introduction to Optimized Batched Inference with vLLM

7 min readApr 12, 2024

by Sergio Morales, Principal Data Engineer at Growth Acceleration Partners.

Get ready to optimize your NLP models with vLLM! In his latest blog post, Sergio Morales explores the practicalities of streamlining NLP models for real-world efficiency. In it, you’ll learn: About the vLLM library for efficient model serving. Practical strategies for conserving computing resources during inference. The art of batched inference for seamless AI deployments.

In a previous article, we talked about how off-the-shelf, pre-trained models made available through Hugging Face’s model hub could be leveraged to fulfill a wide range of Natural Language Processing (NLP) tasks. This included text classification, and question answering and content generation — either by taking advantage of their base knowledge or by fine-tuning them to create specialized models that honed in on particular subject matters or contexts.

In this article, we will introduce the vLLM library to optimize the performance of these models, and introduce a mechanism through which we can take advantage of a large language model (LLM)’s text generation capabilities to make it perform more specific and context-sensitive tasks.

Having access to efficient inference backends plays a pivotal role in optimizing the deployment and usage of natural language processing models and their availability to all kinds of teams and organizations. Given the memory and resources cost associated with LLMs, the ability to conserve computing resources during inference — contributing to reduced latency and improved scalability — is of great value.

Streamlining inference processes not only enhances real-time applications, but is also crucial for minimizing operational costs, making it more economically viable to deploy large-scale language models. Industries that rely on resource-intensive tasks stand to benefit from being able to instantiate sophisticated language models in a sustainable and accessible way.

The vLLM Python Package

vLLM is a library designed for the efficient inference and serving of LLMs, similar to the transformers backend as made available by Hugging Face. It provides high serving throughput and efficient attention key-value memory management using PagedAttention and continuous batching, with PagedAttention serving an optimized version of the classic attention algorithm inspired by virtual memory and paging.

It seamlessly integrates with a variety of LLMs, such as Llama, OPT, Mixtral, StableLM and Falcon, sharing many commonalities with Hugging Face in terms of available models. Per its developers, it’s capable of delivering up to 24x higher throughput than Hugging Face’s transformers, without requiring any model architecture changes.

Once installed on a suitable Python environment, the vLLM API is simple enough to use. In the following example, we instantiate a text generation model off of the Hugging Face model hub (jondurbin/airoboros-m-7b-3.1.2):

from vllm import LLM, SamplingParams


# Create an LLM
llm = LLM(model="jondurbin/airoboros-m-7b-3.1.2",
          gpu_memory_utilization=.95,
          max_model_len=512)


# Provide prompts
prompts = ["Here are some tips for taking care of your skin: ",
           "To cook this recipe, we'll need "]


# adjust sampling parameters as necessary for the task
sampling_params = SamplingParams(temperature=0.2,
                                 max_tokens=100,
                                 min_p=0.15,
                                 top_p=0.85)


# Generate texts from the prompts
outputs = llm.generate(prompts, sampling_params)


# Print outputs
for i, o in zip(prompts, outputs):
    print(f"Prompt: {i}")
    print(f"Output: {o.outputs[0].text}")
    print("-"*30)

In the above fragment, we can see the model is pointed to using the same addressing system as one would when working with the transformers library. Some additional parameters are being passed into the LLM class initialization function, such as gpu_memory_utilization (the fraction of GPU memory to be used for the model executor, set to 0.95 in this case) and max_model_len (the maximum length for the model context).

Before inference, we also instantiate a SamplingParams component to set attributes regarding the model’s behavior; the temperature parameter is often associated with “creativity.” It refers to the randomness of the sampling (the process of choosing the next word when generating text) performed during inference, which means a lower value will make the model more deterministic, and a higher value will make outputs harder to predict. The min_p and top_p values are also involved with sampling tweaking.

The output for the above fragment, as generated on an Amazon EC2 g5.4xlarge instance using a Databricks notebook, is as follows:

Prompt: Here are some tips for taking care of your skin: 
Output: 

1. Cleanse your skin regularly: Use a gentle cleanser to remove dirt, oil, and makeup from your face. This will help prevent clogged pores and breakouts.

2. Exfoliate: Regular exfoliation can help remove dead skin cells, unclog pores, and improve the appearance of your skin.

3. Moisturize: Applying a moisturizer can help keep your skin hydrated and prevent dryness
------------------------------
Prompt: To cook this recipe, we'll need 
Output: 10 ingredients:

- 1 cup of uncooked white rice
- 2 cups of water
- 1 tablespoon of butter
- 1/2 teaspoon of salt
- 1/4 teaspoon of black pepper
- 1/2 cup of diced onion
- 1/2 cup of diced bell pepper
- 1/2 cup of diced celery
- 1/2 cup of diced carrot
------------------------------

Executing Batched Inferences on a Large Dataset

As can be expected, batched inference refers to the practice of processing multiple input sequences simultaneously during the inference phase, rather than one at a time. This strategy capitalizes on the parallelization capabilities of modern hardware, allowing the model to handle multiple inputs in a single computational step.

Batched inference significantly improves overall inference speed and efficiency, as the model processes several sequences concurrently, reducing the computational overhead associated with individual predictions. This technique is especially crucial in scenarios where large-scale language models are deployed for real-time applications, as it helps maximize the utilization of computational resources and ensures faster response times for tasks such as text generation, translation and sentiment analysis.

In the following example, we’ll set up a simple pipeline that will execute a specific, domain-sensitive task on a regular pandas dataframe containing a column of unstructured text. We will construct a dynamic, few-shot prompt using the model’s context, which will allow us to provide a consistent prompt to the model at the same time we feed it distinct inputs to perform inference on. We will continue to use the airoboros model seen above, which uses the llama-2 chat templating format.

First, we follow the templating rules to create a template prompt. Notice the SYS tag to establish a system context, followed by the few-shot examples and the dynamic prompt.

summary_few_shot = """[INST] <<SYS>> You are a helpful assistant who converts long-form board game descriptions into a short standardized format. <</SYS>>
The following are paragraphs describing board games themes and mechanics. Following each record is a single sentence describing the theme and discernible gameplay mechanics:


Record: Earthborne Rangers is a customizable, co-operative card game set in the wilderness of the far future. You take on the role of a Ranger, a protector of the mountain valley you call home: a vast wilderness transformed by monumental feats of science and technology devised to save the Earth from destruction long ago.
Description: Mechanics: Cooperative, Cards. Themes: Wilderness, Conservationism


Record: Ostia is a strategy game for 1-4 players. Players lead a large fleet to explore the ocean, trade and develop the port.
Make good use of the Mancala system to strengthen your personal board and aim for the highest honor!
Description: Mechanics: strategy, trading, mancala. Themes: Ocean exploration


Follow the above examples, and provide a description of the following:
Record: <<input>> [/INST]
Description:"""

We will feed a DataFrame containing long-form descriptions of board games sourced from the website boardgamegeek, such as the ones used as examples in the above prompt, to the model. This will happen through a function that will prepare the text for each generation, inserting it into the prompt template and then running the batched inference process all at once. Finally, a new column will be created in the input DataFrame containing the generated summaries:

def create_summaries(df, template, llm):
    prompts = [template.replace('<<input>>', t) for t in df['description']]


    sampling_params = SamplingParams(temperature=0.2, max_tokens=100, min_p=0.15, top_p=0.85)


    outputs = llm.generate(prompts, sampling_params)


    df['outputs'] = [output.outputs[0].text for output in outputs]


    return df


create_descriptions(df, summary_few_shot, llm)

Upon execution, the resulting Data Frame is displayed. The description column is the original description used as input, and the output column is the summary generated by the LLM, following the instructions provided by the context.

Conclusion

As it can be appreciated from the above, building a batched inference pipeline with a package like vLLM can be relatively easy as long as there is a clear understanding of the underlying data structures and the specific requirements of the inference task, as well as the limits of the technology involved. This approach proves particularly advantageous when dealing with large-scale language models, ensuring the deployment of sophisticated AI solutions remains both rapid and resource-efficient.

At GAP, our expertise not only resides in our ability to leverage state-of-the-art tools and frameworks such as vLLM to fulfill your AI needs, but in integrating them into a bigger data infrastructure that emphasizes efficiency and smart resource utilization. Whether it’s deploying LLMs for natural language understanding, sentiment analysis or other NLP tasks, our approach encompasses an integrated understanding of your organizational objectives.

By combining cutting-edge technologies with a strategic framework, GAP engineers ensure your AI efforts exceed expectations, delivering solutions that are both innovative and resource-efficient.

A Brief Introduction to Optimized Batched Inference with vLLM

The vLLM Python Package

Executing Batched Inferences on a Large Dataset

Conclusion

Written by Growth Acceleration Partners