Deploying a production-ready embeddings service with Pytorch

5 min readJun 25, 2023

Introduction

In the dynamic realm of Artificial Intelligence, the challenge of deploying Natural Language Processing (NLP) models for practical use is often significant. This becomes particularly noticeable when one attempts to deploy a service for computing embeddings using a model like the all-MiniLM-L6-v2 for computing sentence similarities. Despite the plethora of online resources, a comprehensive guide for serving this model with TorchServe — the open-source model serving framework from PyTorch — remains surprisingly elusive. Recognizing this gap, I have created a public GitHub repository to not only offer a hands-on, ready-to-use solution but also to serve as an in-depth guide for those eager to explore the nuances of model deployment. This repository and the associated guide are based on the invaluable foundation provided by this great tutorial from Stane Aurelius, which I adapted and customized to successfully deploy the model for my specific needs.

Sentence similarity embeddings

The chosen model for this project, the all-MiniLM-L6-v2, is a powerful derivative of the MiniLM developed by Microsoft Research, boasting a reputation for its compact size and substantial performance across various NLP tasks. The model’s performance is achieved through a distinctive two-step training process, which initially involves pre-training a larger BERT model on vast datasets and subsequently distilling this knowledge into the smaller MiniLM. The resulting all-MiniLM-L6-v2, fine-tuned from the MiniLM-L12-H384-uncased model, displays impressive performance. For a thorough review of this model’s performance, please refer to this page.

The decision to utilize the all-MiniLM-L6-v2 model in this project was informed by its robust performance, compact size, and widespread popularity for sentence similarity tasks. This model provides efficient language representation without the demanding resource requirements typically associated with larger models, making the all-MiniLM-L6-v2 an ideal choice for deploying an embedding service.

TorchServe: A Robust Solution for Serving PyTorch Models

TorchServe is a purpose-built tool for serving PyTorch models, designed to simplify the process of deploying models at scale. It offers a host of features such as model versioning, which ensures better control over various iterations of models; logging, which aids in monitoring the model’s performance; and metrics, which provide insightful data to optimize the model further. What sets TorchServe apart is its minimal dependencies on other libraries, making it a streamlined solution that lowers the complexity of deployment infrastructure. By reducing dependencies, it also limits potential conflicts and issues that may arise from integrating multiple libraries. Therefore, TorchServe stands as an excellent choice for deploying models like all-MiniLM-L6-v2, providing a robust, efficient, and less cumbersome deployment solution.

Running all-MiniLM-L6-v2 with TorchServe

The GitHub repository provides all the necessary components to deploy a production-ready service for computing sentence embeddings using the all-MiniLM-L6-v2 model. This guide will walk you through the Docker-based method, although the repository also provides instructions to run the server as a standalone process. Please note that the current Docker image does inference on CPU because I initiallyplanned to deploy it on a server without GPU but only a small tweak to the Docker image should be enough to make it work on a GPU. However, the non-docker method described in the repository would select the GPU by default if you want to test that specifically.

Let’s start with our test using the pre-built docker image.

Ensure Docker is Installed

Docker is necessary to run the server as a container. If you haven’t already installed Docker, follow the official Docker installation guide.

Pull and Run the Docker Image

Execute the following command in your terminal. This command pulls the Docker image and runs the server, binding the TorchServe to your local port 8080.

docker run -p 8080:8080 -it ghcr.io/clems4ever/torchserve-all-minilm-l6-v2:latest

Interact with the Service

After the server is up and running, you can query it by sending a curl request as follows:

curl --location 'http://127.0.0.1:8080/predictions/my_model' \
--header 'Content-Type: application/json' \
--data '{
    "input": ["hello, how are you?", "hi, what is up?"]
}'

This command sends the sentences “hello, how are you?” and “hi, what is up?” to the server, and you’ll receive a response with the computed embeddings.

You should observe the embeddings of the two sentences in the response. The server returns a list of two vectors of 384 floats, with each vector representing the embedding of one sentence.

[
  [
    0.019096793606877327,
    0.03446517512202263,
    0.09162796288728714,
    0.0701652243733406,
    -0.029946573078632355,
    ...
  ],
  [
    -0.06470940262079239,
    -0.03830110654234886,
    0.013061972334980965,
    -0.0003482792235445231,
    ...
  ]
]

Those embeddings can then be used in any vector database such as Pinecone or the opensource equivalent like Weaviate, Qdrant, Milvus, etc…

Note that the server in this repository exposes the service over HTTP for demonstration purpose but TorchServe also provides options to run a GRPC server that could integrate with the rest of your GRPC ecosystem.

Wrapping up

Embarking on this journey to deploy the all-MiniLM-L6-v2 model using TorchServe began as a personal endeavor to develop a robust service for computing sentence embeddings for my own document corpus. But my commitment to the ethos of open-source — fostering shared knowledge and collective advancement — propelled me to extend this experience beyond my personal notebooks. The result is this article and the corresponding GitHub repository. It’s my sincere hope that this will prove helpful to others on a similar journey, and perhaps even spur a community of shared experiences and mutual learning. If you find it beneficial, I’d appreciate a star on the repository.

Happy deploying!

Aknowledgments

Stane Aurelius for his great tutorial on TorchServe.
The contributors to the all-MiniLM-L6-v2 model.
The Microsoft Research team who produced the paper about MiniLM.
The HuggingFace team who hosts the models and make them easily available to everyone.

References

[1] Model all-MiniLM-L6-v2 on HuggingFace — https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

[2] Tutorial for serving a model with TorchServe — https://supertype.ai/notes/serving-pytorch-w-torchserve/

[3] Research Paper “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers” — https://arxiv.org/abs/2002.10957

[4] Performance review of all-MiniLM-L6-v2 — https://www.sbert.net/docs/pretrained_models.html#model-overview

[5] TorchServe documentation — https://pytorch.org/serve/

[6] TorchServe github repository — https://github.com/pytorch/serve

[7] The Github repository described in this article — https://github.com/clems4ever/torchserve-all-minilm-l6-v2