Building an Offline API like GhatGPT: Ensuring Privacy and Autonomy

3 min readDec 2, 2023

In the era of AI, accessing powerful language models often involves sending data online, raising concerns about privacy and security. Data sent through popular API-driven models like ChatGPT goes online and can be compromised or used for training. This is unacceptable for the data we send via the API.

I made a guide with which you can create an autonomous and local API similar to ChatGPT, but using the model that is on your local machine. For this solution I used Llama 2.

Key advantages of this idea

1. Data Processing Without Sending It Online

Unlike traditional cloud-based models, running an offline API enables you to process data without the need for internet connectivity.

2. Fine-tuning for Your Dataset

The flexibility of the offline model allows you to fine-tune models specifically for your dataset.

3. Free Usage and Cost-Efficiency

Cloud-based API usage often comes with associated costs, especially for extensive usage.

4. Privacy Assurance

Privacy is paramount, especially when dealing with sensitive data.

Preparing Llama 2

We just need to clone the Llama repository

git clone git@github.com:facebookresearch/llama.git

And download the needed model (I am using 7b-chat)

./download.sh

Creating API

For our API we’ll use Python, Flask API, Fire and Llama Python. Let’s create an api.py file in the same repository and make import:

from llama import Llama
import fire

from torch.multiprocessing import Process, Queue
from flask import Flask, request, jsonify
import torch
import json

We’ll use queues for working with Llama:

request_queues = [Queue() for _ in range(args.world_size)]
response_queues = [Queue() for _ in range(args.world_size)]

Write the main function for running API:

def main():
    print("Initializing Llama...")
    processes = []

    # initialize all Llama 2 processes
    for rank in range(args.world_size):
        p = Process(target=init_process, args=(rank, args.world_size, run, request_queues[rank], response_queues[rank]))
        p.start()
        processes.append(p)

    # wait for Llama 2 initialization
    for rank in range(args.world_size):
        response = response_queues[rank].get()

    print("Starting API...")

    app = Flask(__name__)
    app.route("/chat", methods=["POST"])(message_route)
    app.run("0.0.0.0", port=args.port)

    for p in processes:
        p.join()

if __name__ == "__main__":
    main()

Create a function for running Llama (get messages from the queue, send to Llama, put in response queue answer):

def run(request_queue, response_queue):
    # initialize Llama 2
    generator = Llama.build()

    # send initialization signal
    response_queue.put("init")

    while True:
        # load messages from queue
        dialogs = [request_queue.get()]

        # send messages to Llama 2
        results = generator.chat_completion(
            dialogs,
            temperature=args.temperature,
            top_p=args.top_p
        )

        # get response from Llama 2
        response = results[0]['generation']

        response_queue.put(response)

Finally, create a message route:

def message_route():
    # get messages from request
    messages = request.json.get("messages")

    # add messages to queue for Llama 2
    for rank in range(args.world_size):
        request_queues[rank].put(messages)

    # wait for response
    for rank in range(args.world_size):
        response = response_queues[rank].get()

    # return regular JSON response
    return jsonify(response)

Try to run it

./api.py --model 7b-chat --port=5033 --world_size=1

We build simple fully offline API, which can be used for your private data