LLM Inference QueryCraft for NL2SQL

Step 4. Exploring Techniques, Tools, and Insights for Efficient LLM inference using vLLM and Multi-Threading

Published in

Towards Generative AI

6 min readJun 10, 2024

LLM inference refers to the process of generating text or making predictions based on the learned patterns and representations encoded within the language model. It involves feeding input text into the model and leveraging its knowledge to produce desired outputs.

In the inference phase, an AI model actively engages with real-time data, analyzing the user’s input by referencing the knowledge absorbed during training, which is encoded within its weights or parameters.

We developed the QueryCraft framework to provide an easy solution for fine-tuning Large Language Models (LLMs) to generate SQL queries from natural language (Text2SQL, Text2GraphQL, NL2Query). This framework simplifies the process to quickly build complete GenAI pipelines. In our text2sql QueryCraft Framework, we provide various LLM Infernce options. Below I will demonstrate how to implement different kinds of inference with the codellama/CodeLlama-7b-Instruct-hf model, compare their performance, and make observations using the following libraries and techniques on a 2 x Tesla V100-PCIE-32GB GPU server.

Huggingface transformer
Multi-threaded batch inference
vLLM Inference

1. Huggingface Transformer:

The Transformers library provides a comprehensive set of tools and utilities for working with pre-trained models in natural language processing (NLP) tasks. It includes functionalities such as model loading, tokenization, model architecture definition, fine-tuning, inference, and evaluation. To implement LLM inference using Hugging Face Transformers:

Install the library using pip install transformers.
Load a pre-trained model using from transformers import AutoModelForCausalLM, AutoTokenizer.
Tokenize input text using the tokenizer.
Generate text using the model’s generate() method with desired parameters.

## peprate promt
def prepare_context(question: str, context: str):
    text = f"""
    [INST] Write SQLite query to answer the following question given the database schema. Please wrap your code answer using ```: Schema: {context} [/INST] Here is the SQLite query to answer to the question:{question} ```
    """
    return text

device_map = "auto"
base_model = "codellama/CodeLlama-7b-Instruct-hf"

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    #load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Function to process a batch of rows
def generation(row, _index):
    print(f"processing row {_index}")
    text = prepare_context(row["question"], row["context"])
    # Add a padding token to the tokenizer
    input_tokens = tokenizer(text, return_tensors="pt").to("cuda")

    with torch.inference_mode():
        sequences = model.generate(
            **input_tokens,
            num_return_sequences=1,
            eos_token_id=eos_token_id,
            pad_token_id=eos_token_id,
            max_new_tokens=400,
            do_sample=False,
            ##num_beams=5
        )
    outputs = tokenizer.batch_decode(sequences, skip_special_tokens=True)
    torch.cuda.empty_cache()

    result = outputs[0][len(text):].split("```")[0]
    return result

Huggingface serial time taken: 591.43 minutes for 1034 row

Users have fine-grained control over the entire model pipeline, allowing customization at various stages of processing. This library is best suited for development purpose and experiment.

2. Multi-threading:

Multi-threading can be employed to parallelize LLM inference tasks, improving throughput and responsiveness. To implement multi-threading:

Use concurrent programming libraries like threading or concurrent.futures in Python.
Distribute input text or inference requests across multiple threads.
Ensure proper synchronization and coordination between threads to avoid data races and maintain consistency.

I’m still using the generation function and making it process in multiple threads.

# Configuration
model_name = "codellama/CodeLlama-7b-Instruct-hf"
num_threads = 5  # Adjust as needed
total_rows = 1000
data_path = "<DATA PATH>"

 with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = [executor.submit(generation, row, _index) for _index, row in df_validation.iterrows()]

        # Wait for all threads to finish and gather results
        results = []
        for future in concurrent.futures.as_completed(futures):
            rs = future.result()
            results.append(rs)
            logging.info(f"query: {rs}")
            #results.extend(future.result())
        df_validation['mlops'] = results
        #print(f"length {len(results)}")

    ## Saving result to a csv
    df_validation.to_csv("result.csv")

Huggingface serial time taken: 464.38 minutes for 1034 row

Observation: There is not much improvement over the sequential process over multi threading with 5 thread. Multithreading allows multiple tasks to happen simultaneously within one program. This means we’re using resources like CPU or GPU within that single program. When we create threads, all those threads will automatically use the GPU. This can lead to problems like running out of memory on the GPU or slowing down the inference. So, using multithreading isn’t the best solution for this problem. We ran 100 request with different number of thread but did not find the performance improvement with mulli-threading approach.

Some Suggestions

Some tips for improving performance with Hugging Face Transformer and accelerator libraries.

Batch Processing: Process multiple inputs simultaneously by batching them together. This reduces the overhead of processing individual inputs and allows for parallel execution, leading to improved throughput.
Mixed Precision: Utilize mixed precision training and inference if your GPU supports it. The Hugging Face Accelerated Inference API supports mixed precision inference for certain models, resulting in faster execution.
Caching and Memoization: Cache frequently used inputs and their corresponding outputs to avoid redundant computations.
Model Optimization: Fine-tune or quantize the model to reduce its size and computational requirements while preserving performance. Hugging Face provides tools for model optimization, including model pruning, quantization, and distillation.
Asynchronous Inference: Implement asynchronous inference to handle multiple requests concurrently without waiting for each request to complete before processing the next one.

3. vLLM Inference:

vLLM is a fast and easy-to-use library for LLM inference and serving. It is designed for the efficient deployment of large language models (LLMs). It offers high serving throughput and optimized attention KV memory management through features like PagedAttention mechanism and continuous batching.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
Optimized CUDA kernels

for more information, please go here https://docs.vllm.ai/en/latest/index.html

conda create -n vllm python=3.9 -y
conda activate vllm
pip install vllm

llm = LLM(
    model="codellama/CodeLlama-7b-Instruct-hf", 
    dtype=torch.float16,
)

sampling_params = SamplingParams(
    temperature=0.7, 
    max_tokens=400,
    use_beam_search=True,
    #best_of=2,
    )

def create_prompt(row) -> str:
    return prepare_context(row["question"], row["context"])

## prepare the batch input
prompts = [create_prompt(row) for _index, row in df_validation.iterrows()]

## Batch inference just simply pass all an array of prompts
vllm_batch_outputs = llm.generate(prompts, sampling_params)

Processed prompts: 100%|██████████| 1034/1034 [04:41<00:00,  3.68it/s]
CPU times: user 4min 42s, sys: 320 ms, total: 4min 42s
Wall time: 4min 42s

vLLM has shown promising performance without compromising the accuracy of the generated SQL. It is easy to use, but two shortcoming

it doesn’t support all transformer models. To see the list of supported models, please follow this link. https://docs.vllm.ai/en/v0.3.3/models/supported_models.html
Since it utilizes the PagedAttention mechanism, there may be some accuracy issues. It’s advisable to conduct an evaluation process for vLLM, although it’s recommended for any type of inference, whether it involves vLLM or not. IBM Watsonx.Governance can help you to achieve this with model monitoring, evaluation and governance.

Performance comparison in minutes for 1034 requests

Conclusion

LLM inference represents a pivotal advancement in natural language processing, enabling a wide range of applications across diverse domains. By understanding the underlying techniques and leveraging state-of-the-art tools and technologies like Hugging Face Transformers, VLLMs, Hugging Face Accelerator, and multi-threading, developers can unlock the full potential of LLMs for various text generation tasks. As LLMs continue to evolve and improve, they promise to reshape the landscape of human-computer interaction and drive innovation in artificial intelligence.

Ready to take your NL2SQL models to the next level? Explore QueryCraft’s evaluation framework today and unlock the full potential of your LLMs!

To explore the specific functionalities of each evaluation component in greater detail, please refer to our details blog posts on the Fine-Tunning NL2SQL, Improving NL2SQL Accuracy and NL2SQL Execution Accuracy Measurements.

Follow Towards Generative AI for more content related to latest in AI advancement.