Building an Offline API like GhatGPT: Ensuring Privacy and Autonomy

Ihor Khrypchenko
3 min readDec 2, 2023

--

In the era of AI, accessing powerful language models often involves sending data online, raising concerns about privacy and security. Data sent through popular API-driven models like ChatGPT goes online and can be compromised or used for training. This is unacceptable for the data we send via the API.

I made a guide with which you can create an autonomous and local API similar to ChatGPT, but using the model that is on your local machine. For this solution I used Llama 2.

Key advantages of this idea

1. Data Processing Without Sending It Online

Unlike traditional cloud-based models, running an offline API enables you to process data without the need for internet connectivity.

2. Fine-tuning for Your Dataset

The flexibility of the offline model allows you to fine-tune models specifically for your dataset.

3. Free Usage and Cost-Efficiency

Cloud-based API usage often comes with associated costs, especially for extensive usage.

4. Privacy Assurance

Privacy is paramount, especially when dealing with sensitive data.

Preparing Llama 2

We just need to clone the Llama repository

git clone git@github.com:facebookresearch/llama.git

And download the needed model (I am using 7b-chat)

./download.sh

Creating API

For our API we’ll use Python, Flask API, Fire and Llama Python. Let’s create an api.py file in the same repository and make import:

from llama import Llama
import fire

from torch.multiprocessing import Process, Queue
from flask import Flask, request, jsonify
import torch
import json

We’ll use queues for working with Llama:

request_queues = [Queue() for _ in range(args.world_size)]
response_queues = [Queue() for _ in range(args.world_size)]

Write the main function for running API:

def main():
print("Initializing Llama...")
processes = []

# initialize all Llama 2 processes
for rank in range(args.world_size):
p = Process(target=init_process, args=(rank, args.world_size, run, request_queues[rank], response_queues[rank]))
p.start()
processes.append(p)

# wait for Llama 2 initialization
for rank in range(args.world_size):
response = response_queues[rank].get()

print("Starting API...")

app = Flask(__name__)
app.route("/chat", methods=["POST"])(message_route)
app.run("0.0.0.0", port=args.port)

for p in processes:
p.join()

if __name__ == "__main__":
main()

Create a function for running Llama (get messages from the queue, send to Llama, put in response queue answer):

def run(request_queue, response_queue):
# initialize Llama 2
generator = Llama.build()

# send initialization signal
response_queue.put("init")

while True:
# load messages from queue
dialogs = [request_queue.get()]

# send messages to Llama 2
results = generator.chat_completion(
dialogs,
temperature=args.temperature,
top_p=args.top_p
)

# get response from Llama 2
response = results[0]['generation']

response_queue.put(response)

Finally, create a message route:

def message_route():
# get messages from request
messages = request.json.get("messages")

# add messages to queue for Llama 2
for rank in range(args.world_size):
request_queues[rank].put(messages)

# wait for response
for rank in range(args.world_size):
response = response_queues[rank].get()

# return regular JSON response
return jsonify(response)

Try to run it

./api.py --model 7b-chat --port=5033 --world_size=1

We build simple fully offline API, which can be used for your private data

--

--