Deploying Large Language Models (LLMs) in Real-World Applications

Deepak Babu P R
4 min readApr 28, 2023

--

The growing capabilities of Large Language Models (LLMs) have led to an increasing demand for their integration into production systems. However, implementing LLMs with low-latency requirements like Ads and Search (think bidding models, relevance engine, CTR prediction, etc.) can be challenging due to their massive size and complexity. In this blog post, we will present a strategy for deploying LLMs in latency-sensitive systems by using a combination of knowledge distillation, model compression, and optimized deployment techniques.

The traditional separation of teams into software (SWE) and modeling (ML) presents challenges in the world of large language models (LLMs), as it leads to complexities in deployment. Some have solved this problem through introduction of MLE roles (Machine Learning Engineering) to act as bridging role between scientists and engineers. We also discuss mental model to meaningfully innovate and make progress without excessively overstepping boundaries of individual roles.

The definition of “large” in Large Language Models (LLMs) has evolved over time, with GPT-1 models having 117M parameters, GPT-2 with 1.5B, and the recent GPT-3 model boasting 175B parameters. The increase in size has been driven by the improved state-of-the-art accuracies achieved on various NLP benchmarks as models scale up. Furthermore, there are significant differences in architectures, such as parametric models, non-parametric models with external knowledge bases or retrieval augmentation, and instruction-tuned models. The larger parameter size has enabled LLMs to act as world models capable of reasoning and planning, which has led to substantial progress in solving complex problems. However, these gains observed in offline experiments can be challenging to transfer to practical applications with tight latency constraints. The following sections will address the strategy for transitioning LLM gains from experimentation to production, with broad applicability to any class of LLMs.

Lets take the example of a company interested in building CTR prediction models for Ad bidding using LLM

  1. Engineers/MLEs work backwards from the use-case to determine the acceptable latency for such a system leading to a particular choice of parameter size considering a suitable decoder architecture like T5, BART, GPT and input/output sequence length. Lets call this student model.Factors that influence online bidding latency could be — ad exchange, competitive nature of bidders, missed opportunity of bidding, etc. Online advertising is a very competitive business with multiple players in the ecosystem from small publishers to aggregators to ad exchanges that maintain an ecosystem of ad networks.
  2. Parallel, science or ML team iterates on LLMs with different architectures like RAG, parameteric, LLM agents, Instruct LLM, etc. to find one or more approaches/models that provide good CTR prediction accuracies. Lets call these models as teacher models as they are expected to be superior CTR predictors than if we were to train the smaller-parameter LLM (student) chosen for run-time serving.
Typical LLM setup showing multiple teacher models and a distilled student model that is small enough to run under latency constraints demanded by online production system.

Following are some popular techniques used to productionalize LLMs

  • knowledge distillation [Gou et al.] [Hinton et al.] — is a technique that involves training or teaching a smaller LLM (that is runtime ready referred as “student” model) to mimic the behavior of larger LLM(referred as “teacher”). This is achieved by using larger parameter LLMs to generate training data for a smaller student LLM to fine-tune on. To control the quality of machine generated training data, we could consider model confidence among other domain constraints to selectively sample training data generated from the larger teacher LLM. The effectiveness of distillation is measured using distillation efficiency which is ratio of student model accuracy to teacher model accuracy.
    There are variants of knowledge distillation that involves joint training of teacher and student with loss function focussed on reducing combined cross-entropy loss between student output and groundtruth + KL divergence between prob. distribution of student and teacher for same input. However, the former method is preferred for its simplicity and generic framework that is applicable to multiple teacher models.
  • Model Compression [Dettmers et al.] — Model compression utilizes a collection of techniques, including pruning and quantization, to reduce model size and complexity. Pruning typically involves removing encoder/decoder layers or blocks and decreasing the number of hidden units per layer, thus lowering the model’s parameter count. Quantization, on the other hand, involves representing weights and activations with fewer bits, such as converting weights from FP32 to lower precision formats like FP16 or even INT8, which decreases computational overhead during inference. For example, a BART large model with 12 encoder and 12 decoder layers, totaling 400M parameters, can be compressed by removing 4 encoder and 4 decoder blocks, resulting in a sub-100M parameter model more suitable for real-time inference
  • Model Deployment — Optimizing deployment involves selecting open formats such as ONNX to benefit from hardware acceleration and runtime optimization. ONNX, or Open Neural Network Exchange, is a widely-used open-source standard for representing deep learning models, which supports cross-platform deployment (including desktops, servers, and mobile devices) while abstracting hardware optimization from the platform. Models in ONNX format are interoperable between various deep learning frameworks, such as PyTorch and TensorFlow.

In addition to these, there are considerations on hosting infrastructure i.e whether to use GPU or CPU and appropriate choice of hardware that is beyond the scope of this post. As we conclude this blog post, we’d love to hear your thoughts, suggestions, or experiences related to deploying Large Language Models in production. Have you encountered any unique challenges or discovered innovative solutions during your own journey? Please feel free to share your insights in the comments section below. Your input can greatly benefit the community as we continue to explore the vast potential of LLMs in real-world applications.

--

--

Deepak Babu P R

Principal Scientist | ML/AI, NLP, IR and speech | love travelling, reading, trekking and photography. https://prdeepakbabu.github.io/