Building an Offline API like GhatGPT: Ensuring Privacy and Autonomy
In the era of AI, accessing powerful language models often involves sending data online, raising concerns about privacy and security. Data sent through popular API-driven models like ChatGPT goes online and can be compromised or used for training. This is unacceptable for the data we send via the API.
I made a guide with which you can create an autonomous and local API similar to ChatGPT, but using the model that is on your local machine. For this solution I used Llama 2.
Key advantages of this idea
1. Data Processing Without Sending It Online
Unlike traditional cloud-based models, running an offline API enables you to process data without the need for internet connectivity.
2. Fine-tuning for Your Dataset
The flexibility of the offline model allows you to fine-tune models specifically for your dataset.
3. Free Usage and Cost-Efficiency
Cloud-based API usage often comes with associated costs, especially for extensive usage.
4. Privacy Assurance
Privacy is paramount, especially when dealing with sensitive data.
Preparing Llama 2
We just need to clone the Llama repository
git clone git@github.com:facebookresearch/llama.git
And download the needed model (I am using 7b-chat)
./download.sh
Creating API
For our API we’ll use Python, Flask API, Fire and Llama Python. Let’s create an api.py file in the same repository and make import:
from llama import Llama
import fire
from torch.multiprocessing import Process, Queue
from flask import Flask, request, jsonify
import torch
import json
We’ll use queues for working with Llama:
request_queues = [Queue() for _ in range(args.world_size)]
response_queues = [Queue() for _ in range(args.world_size)]
Write the main function for running API:
def main():
print("Initializing Llama...")
processes = []
# initialize all Llama 2 processes
for rank in range(args.world_size):
p = Process(target=init_process, args=(rank, args.world_size, run, request_queues[rank], response_queues[rank]))
p.start()
processes.append(p)
# wait for Llama 2 initialization
for rank in range(args.world_size):
response = response_queues[rank].get()
print("Starting API...")
app = Flask(__name__)
app.route("/chat", methods=["POST"])(message_route)
app.run("0.0.0.0", port=args.port)
for p in processes:
p.join()
if __name__ == "__main__":
main()
Create a function for running Llama (get messages from the queue, send to Llama, put in response queue answer):
def run(request_queue, response_queue):
# initialize Llama 2
generator = Llama.build()
# send initialization signal
response_queue.put("init")
while True:
# load messages from queue
dialogs = [request_queue.get()]
# send messages to Llama 2
results = generator.chat_completion(
dialogs,
temperature=args.temperature,
top_p=args.top_p
)
# get response from Llama 2
response = results[0]['generation']
response_queue.put(response)
Finally, create a message route:
def message_route():
# get messages from request
messages = request.json.get("messages")
# add messages to queue for Llama 2
for rank in range(args.world_size):
request_queues[rank].put(messages)
# wait for response
for rank in range(args.world_size):
response = response_queues[rank].get()
# return regular JSON response
return jsonify(response)
Try to run it
./api.py --model 7b-chat --port=5033 --world_size=1
We build simple fully offline API, which can be used for your private data