How to Build an OpenAI-Compatible API using Any FastAPI Application with LLM Integration: A Step-by-Step Guide

Subhrajit Mohanty
9 min readJun 19, 2024

--

OpenAI Wrapper is a versatile tool that simplifies interactions with OpenAI’s API. It streamlines the integration process, allowing developers to easily incorporate AI-powered functionalities into their applications. With intuitive methods for generating text, handling conversations, and managing responses, it reduces the complexity of direct API calls. This wrapper is ideal for both beginners and advanced users, facilitating the rapid development of innovative projects leveraging OpenAI’s powerful language models.

In this tutorial, we’ll create a FastAPI application to serve as a versatile interface for the Groq API, supporting both batch and streaming outputs. Our goal is to configure this wrapper for seamless integration with the OpenAI SDK. This setup allows us to pass the API key and FastAPI URL as the base URL to efficiently receive responses, demonstrating a practical approach to harnessing the power of AI through streamlined API interactions.

Take a closer look at the code structure

├── OpenAI-SDK-compatible-API
│ ├── Readme.md
│ ├── __version__.py
│ ├── app.py
│ ├── notebooks
│ ├── requirements.txt
│ ├── routes
│ │ ├── __init__.py
│ │ ├── completion.py
│ │ ├── health.py
│ ├── run.py
│ └── run.sh

Let’s start by examining the code snippet provided

Setting Up the Project

Here’s a glimpse into the structure of app.py

import os
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from __version__ import version, title, description
from routes import health, completion

root_path = "/"
if os.getenv("ROUTE"):
root_path = os.getenv("ROUTE")

app = FastAPI(
title=title,
version=version,
description=description,
root_path=root_path,
)

origins = ["*"]

app.add_middleware(
CORSMiddleware,
allow_origins=origins,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

app.include_router(health.router)
app.include_router(completion.router)
import os
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from __version__ import version, title, description
from routes import health, completion

Importing Modules

  • os: This module provides a way to interact with the operating system, allowing us to fetch environment variables.
  • FastAPI: This is the main framework we're using to create our API.
  • CORSMiddleware: Middleware from FastAPI to handle CORS, which is essential for enabling resource sharing between different origins.
  • __version__: This custom module presumably contains metadata about the application, such as its version, title, and description.
  • routes: This module includes our route definitions, specifically health and completion.

Configuring the Root Path

root_path = "/"
if os.getenv("ROUTE"):
root_path = os.getenv("ROUTE")

Here, we’re setting up a root_path for our API. By default, it is set to "/", but it can be overridden by an environment variable ROUTE. This is useful for deploying the application in different environments where the base path might change.

Initializing the FastAPI Application

app = FastAPI(
title=title,
version=version,
description=description,
root_path=root_path,
)

We initialize the FastAPI application with several parameters:

  • title: The title of the API, imported from __version__.
  • version: The version of the API, also from __version__.
  • description: A description of the API.
  • root_path: The base path for the API, which we've just configured.

Adding CORS Middleware

origins = ["*"]

app.add_middleware(
CORSMiddleware,
allow_origins=origins,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

To handle CORS, we add CORSMiddleware to our application. This allows our API to be accessible from any origin (origins = ["*"]), which is useful for development but might need to be restricted in a production environment.

  • allow_origins: Specifies the origins that are allowed to make requests. Here, ["*"] means all origins are allowed.
  • allow_credentials: If set to True, cookies are supported.
  • allow_methods: Specifies the HTTP methods allowed when accessing the resource. ["*"] allows all methods.
  • allow_headers: Specifies the headers that are allowed. ["*"] allows all headers.

Including Routes

app.include_router(health.router)
app.include_router(completion.router)

Finally, we include our routers. Routers in FastAPI help in organizing the endpoints by grouping them into separate modules.

  • health.router: This route might be used for health checks, ensuring the API is running correctly.
  • completion.router: This could be a route related to some form of completion functionality, possibly for a task or process.

Here’s an overview of the structure found in routes/__init__.py:

from typing import List, Optional

from pydantic import BaseModel

class Message(BaseModel):
role: str
content: str

class ChatCompletionRequest(BaseModel):
model: str = "llama3-70b-8192"
messages: List[Message]
max_tokens: Optional[int] = 512
temperature: Optional[float] = 0.1
stream: Optional[bool] = False

Here’s an overview of the structure within routes/completion.py

Importing Necessary Modules

from fastapi import APIRouter, HTTPException, Depends
from routes import ChatCompletionRequest, Message
from starlette.responses import StreamingResponse
from fastapi.security.api_key import APIKeyHeader
from dotenv import load_dotenv
from openai import OpenAI
from typing import Optional, List
import asyncio
import time
import json
import os

We start by importing various modules:

  • APIRouter, HTTPException, Depends from FastAPI for routing and dependency injection.
  • ChatCompletionRequest, Message from our routes module.
  • StreamingResponse from Starlette for streaming responses.
  • APIKeyHeader from FastAPI for API key security.
  • load_dotenv from python-dotenv to load environment variables.
  • OpenAI for interacting with the OpenAI API.
  • Optional, List from typing for type hinting.
  • Standard Python modules asyncio, time, json, and os.

Loading Environment Variables

load_dotenv()

BASE_URL = "https://api.groq.com/openai/v1"
GROQ_API_KEY = os.getenv("GROQ_API_KEY")

Here, we load environment variables from a .env file. The BASE_URL is the endpoint for the OpenAI API, and GROQ_API_KEY is retrieved from the environment variables.

Setting Up API Key Security

API_KEY = "1234"  # Replace with your actual API key
API_KEY_NAME = "Authorization"
api_key_header = APIKeyHeader(name=API_KEY_NAME, auto_error=False)

We define the API key and its header name. APIKeyHeader helps in extracting the API key from request headers.

Initializing the API Router and OpenAI Client

router = APIRouter()

client = OpenAI(
api_key = GROQ_API_KEY,
base_url = BASE_URL
)

An instance of APIRouter is created for routing, and the OpenAI client is initialized with the API key and base URL.

Verifying the API Key

def verify_api_key(api_key: str = Depends(api_key_header)):
if api_key is None:
print("API key is missing")
raise HTTPException(status_code=403, detail="API key is missing")
if api_key != f"Bearer {API_KEY}":
print(f"Invalid API key: {api_key}")
raise HTTPException(status_code=403, detail="Could not validate API key")
print(f"API key validated: {api_key}")

This function checks if the API key is provided and valid. If the key is missing or incorrect, it raises a 403 Forbidden HTTP exception.

Creating the Asynchronous Response Generator

async def _resp_async_generator(messages: List[Message], model: str, max_tokens: int, temperature: float):

response = client.chat.completions.create(
model=model,
messages=[{"role": m.role, "content": m.content} for m in messages],
max_tokens=max_tokens,
temperature=temperature,
stream=True
)

for chunk in response:
chunk_data = chunk.to_dict()
yield f"data: {json.dumps(chunk_data)}\n\n"
await asyncio.sleep(0.01)
yield "data: [DONE]\n\n"

This asynchronous generator handles streaming responses from the OpenAI API. It iterates over the response chunks, converts them to a JSON-serializable format, and yields them. A small delay (await asyncio.sleep(0.01)) simulates streaming behavior.

Defining the Chat Completions Endpoint

@router.post("/v1/chat/completions", dependencies=[Depends(verify_api_key)])
async def chat_completions(request: ChatCompletionRequest):
if request.messages:
if request.stream:
return StreamingResponse(
_resp_async_generator(
messages=request.messages,
model=request.model,
max_tokens=request.max_tokens,
temperature=request.temperature
), media_type="application/x-ndjson"
)
else:
response = client.chat.completions.create(
model=request.model,
messages=[{"role": m.role, "content": m.content} for m in request.messages],
max_tokens=request.max_tokens,
temperature=request.temperature,
)
return response
else:
return HTTPException(status_code=400, detail="No messages provided")

This endpoint handles POST requests to /v1/chat/completions. It depends on the verify_api_key function to check for a valid API key.

  • If the request contains messages and requests a stream, it returns a StreamingResponse using the asynchronous generator.
  • If no streaming is requested, it directly calls the OpenAI API and returns the response.
  • If no messages are provided, it raises a 400 Bad Request exception.

Here’s an overview of the structure of routes/health.py

Setting Up the APIRouter

from fastapi import APIRouter, HTTPException
router = APIRouter()

We begin by importing APIRouter and HTTPException from FastAPI. APIRouter allows us to define routes within a modular component, making our code organized and easier to manage.

Defining the Health Check Endpoint

@router.get("/health")
async def healthCheck():
return {"message": "success"}

Here, we define a GET endpoint /health using the @router.get decorator. This endpoint is designed to perform a health check on our application.

Endpoint Functionality Explained

  • Decorator (@router.get("/health")): Specifies that this function handles GET requests to the /health endpoint within our router.
  • Function (async def healthCheck():): This asynchronous function executes when a GET request is made to /health.
  • Return Value (return {"message": "success"}): Responds with a JSON object containing a "message" key with the value "success". This simple response indicates that the application is running and healthy.

To run application use following command

gunicorn -k uvicorn.workers.UvicornWorker --workers 1 --threads=1 --max-requests 512 --bind 0.0.0.0:8050 app:app

Let’s dissect each part of this command:

  • gunicorn: This is the command to start the Gunicorn server.
  • -k uvicorn.workers.UvicornWorker: Specifies the type of worker class Gunicorn should use. Here, uvicorn.workers.UvicornWorker is used, which is specifically designed for running ASGI applications like FastAPI. Uvicorn is the ASGI server that FastAPI uses, and using UvicornWorker ensures compatibility and optimal performance.
  • --workers 1: This option specifies the number of worker processes Gunicorn should spawn to handle requests. In this case, only one worker process (--workers 1) is used. Adjusting the number of workers depends on your application's load and the resources available on your server. Increasing the number of workers can improve concurrency and handle more requests simultaneously.
  • --threads=1: This sets the number of threads per worker process. Each worker process can handle multiple threads to handle concurrent requests. In this command, each worker process is configured to use only one thread (--threads=1). Adjusting the number of threads per worker depends on the nature of your application and workload. FastAPI typically benefits more from scaling with multiple processes (--workers) rather than threads.
  • --max-requests 512: Defines the maximum number of requests a worker process will handle before it is restarted. This is useful for preventing memory leaks or other issues that may accumulate over long periods of continuous operation. After handling 512 requests (--max-requests 512), Gunicorn will gracefully restart the worker process. Adjust this value based on your application's memory usage and stability needs.
  • --bind 0.0.0.0:8050: Specifies the socket to bind Gunicorn to. Here, it binds to all available network interfaces (0.0.0.0) on port 8050. This means Gunicorn will listen for incoming HTTP requests on port 8050, making your FastAPI application accessible via http://<your-server-ip>:8050.
  • app:app: This specifies the module and the callable within that module that Gunicorn should load to find your FastAPI application. In this case, app refers to the Python module (app.py or similar) and app again is the actual FastAPI instance within that module.

Now that the application is up and running at http://localhost:8050, let’s explore how to utilize the OpenAI SDK to interact with the API, both for streaming and batch processing.

first, we need to import the OpenAI class from the OpenAI library. Then, we initialize the client and connect it to our local server.

from openai import OpenAI

# init client and connect to localhost server
client = OpenAI(
api_key="1234",
base_url="http://localhost:8050/v1" # change the default port if needed
)

In this snippet, we initialize the OpenAI client with an API key ("1234") and set the base_url to point to our local server running on port 8050. You can change the port number if your server is running on a different port.

Configuring Streaming

Next, we set up a variable to determine whether we want to handle responses synchronously or as a stream.

stream = True

By setting stream to True, we prepare to handle streaming responses. If set to False, the response will be handled synchronously.

Making the API Call

We make an API call to get a chat completion response

# call API
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "What is the meaning of life?",
}
],
model="llama3-70b-8192",
stream=stream
)

Here, we call the client.chat.completions.create method with the following parameters:

  • messages: A list containing a dictionary with the user's message.
  • model: Specifies the model to use, in this case, llama3-70b-8192.
  • stream: Indicates whether the response should be streamed.

Handling the Response

Depending on the value of stream, we handle the response differently.

Synchronous Response

If stream is set to False, we print the response directly.

if stream == False:
print(chat_completion.choices[0].message.content)

In this case, we access the message content from the response and print it. The choices list contains the completion options, and we select the first option (choices[0]).

Streaming Response

If stream is set to True, we handle the response as a stream.

if stream == True:
for chunk in chat_completion:
print(chunk.choices[0].delta.content, end="", flush=True)

Here, we iterate over the response chunks. Each chunk contains a partial completion of the response. We print each chunk’s content, ensuring it appears continuously by setting end="" and flush=True.

You can access the complete code repository here: OpenAI-SDK-compatible-API. Follow the provided steps to easily customize it according to your specific requirements. This repository offers a practical starting point for integrating OpenAI’s SDK into your applications, allowing flexibility for adaptation and enhancement based on individual project needs.

Conclusion

In conclusion, integrating OpenAI’s LLM models into a FastAPI application empowers developers to build sophisticated AI-driven APIs efficiently. By following this step-by-step guide, developers can seamlessly incorporate advanced language generation capabilities into their applications. Leveraging FastAPI’s intuitive framework alongside OpenAI’s powerful LLM models not only enhances API functionality but also opens doors to diverse applications in natural language processing. This comprehensive approach ensures robust, scalable solutions that harness cutting-edge AI technologies for a wide range of use cases.

--

--

Subhrajit Mohanty

Head of Engineering at Katonic, expert in data engineering, DevOps, MLOps, AI, and generative AI, solving complex AI challenges with innovative solutions.