Elevating LLM Deployment with FastAPI and React: A Step-By-Step Guide

George Wen
4 min readMar 27, 2024

--

In a previous exploration, I delved into creating a Retrieval-Augmented-Generation (RAG) demo, utilising Google’s gemma model, Hugging Face, and Meta’s FAISS, all within a Python notebook. This demonstration showcased the potential to build a locally-run, RAG-powered application.

The conceptual flow of using RAG with LLMs. (Source)

This article aims to advance that groundwork by deploying the model and RAG functionality via FastAPI, with a subsequent consumption of the API through a straightforward ReactJS frontend. A notable enhancement in this iteration is the integration of the open-source Mistral 7b model and the Chroma vector database. The Mistral 7b model is acclaimed for its optimal balance between size and performance, surpassing the Llama 2 13B model across benchmarks and matching the prowess of Google’s gemma model.

Chroma is a leading AI-native open source embedding database, amontg one of the most popular vector databases. While we are currently operating a local instance, Chroma also offers a cloud-based Platform as a Service (PaaS), which could alleviate the need for managing our infrastructure.

The FastAPI serving layer implementation is remarkably straightforward. Its structure echoes the previously explored RAG sample, comprising two primary classes: one for the retriever and another for the Q&A assistant. FastAPI serves to define the API endpoint and facilitate calls to the assistant.

Below, I present a snippet from the retriever class, tasked with scanning PDF files within a specified directory and generating vector indexes for Chroma database storage. This initialization process is a one-time operation.

Next, we examine the Assistant class. It accepts a retriever instance as constructor input and defines a method to engage the RAG chain. This involves executing a similarity search, integrating the outcomes with the LLM model, and producing the final response. One of the key difference from the previous example is the adoption of LlamaCpp instead of the HuggingFace transformer for better performance.

Furthermore, here’s a snippet outlining the API endpoint definition. FastAPI’s design significantly streamlines the development of Python-based APIs. Within this API setup, we instantiate retriever and Q&A assistant objects, forwarding the necessary parameters to produce the final response.

With the code fully assembled, initiating the web server is accomplished via the following command:

sudo uvicorn app.main:app — port 8080

Navigating to http://localhost:8080/docs in your browser will present the OpenAPI Swagger UI. you can test the API out from the interface by clicking on ‘Try it out’ button.

Here is the output from the web server console which gives you the overview of the operational metrics etc.

To encapsulate this journey, I showcase a simple chatbot client developed using ReactJS. The foundational code was actually produced by ChatGPT, with modifications to add the functionality to call our FastAPI endpoint.

In conclusion, with the advancements in AI, particularly open sourceLLMs, implementing LLM has become increasingly accessible. However, production deployments require careful consideration of factors like scalability, security, and governance.

While local LLM deployments might be necessary for specific use cases, leveraging SaaS-based solutions often proves more cost-effective, especially for moderate consumption volumes. This project serves as a simplified educational setup, demonstrating the core functionalities. For production environments, explore additional features and best practices to ensure a robust and secure implementation.

--

--