Train ChatGPT with custom data and create your own chat bot using MacOS

Sohaib Shaheen
7 min readMar 25, 2023

--

Building your own AI-powered chatbot has never been easier. With OpenAI’s ChatGPT, you can train a language model using custom data tailored to your specific needs.

In this article, I will guide you through the process of training ChatGPT with custom data on a macOS. By the end of this guide, you will have a working knowledge of how to set up, prepare your data, and fine-tune your chatbot.

Step 1: Install python

You need python 3.0+ to start. Before jumping into installation, I recommend checking if you already have python3 installed, you can do that by running following command:

If you see the version listed after your execute command, it means that you already have python3 installed and you can skip this step. If you see command not found error then proceed with installation below:

Head over to following link and download python installer: https://www.python.org/downloads/

Once done, run installer and wait for it to finish. After its finished, run the above command again and it should output the version of python, as shown in screenshot.

Step 2: Upgrade Pip

Now Python comes with pip pre-packaged but incase you are using the old installation, its always good idea to update the pip to latest version. If you are wondering what pip is, its basically a package manager for python, kind of like composer for PHP. You can upgrade it using very simple command:

python3 -m pip install -U pip

If you have pip installed already, it will give you a warning e.g. Requirement already satisfied: pip in [location-here]

If you don’t have the latest version of pip, it will install that. You can now verify if its installed properly or not by executing following command:

It will tell you version and location of package.

Unfortunately, your woe might not end here. It might install pip in a directory which is not in your PATH variable. To add to path, you can run following command in terminal

nano ~./bash_profile

and then add the installation directory to PATH. Depending on your existing file, it might look something similar to this:

Notice the second PATH that I have added for python.

Step 3: Install libraries

Before diving into the actual training process, you’ll need to install some libraries. Open the Terminal application on your Mac and run the following commands one by one:

First command installs OpenAI library:

pip3 install openai

Next, install GPT index, which is also called LlamaIndex. It allows the LLM to connect to the external data that is our knowledge base.

pip3 install gpt_index

Once done, run following command:

pip3 install PyPDF2

Its python based PDF parsing library and needed if you are going to feed PDF files to model.

and finally you have to run:

pip3 install gradio

which creates simple UI to interact with AI chatgpt.

Finally we are done with installing libraries and we can move on to creating script for training and prepare data as well.

Step 4: Get OpenAI key

Before diving into script, lets get API key from Open AI. Head over to:

If you havent’ logged in already, it will ask you to login. You can then click on Create new secret key to generate a key for our script:

Remember that once key is generated, you won’t be able to see it again. You must copy and save the key in some secure location to be able to access it later.

Step 5: Prepare data

Create a new directory named ‘docs’ anywhere you like and put PDF, TXT or CSV files inside it. You can add multiple files if you like but remember that more data you add, more the tokens will be used. Free accounts are given 18$ worth of tokens to use.

Step 6: Create script

Now that everything is in place, our next step is to create a python script to train chatbot with custom data. It will use files inside doc directory, that we created above, and generate a json file.

You can use any text editor to create this file. macOS comes with TextEdit, you can use that or if you are using Visual Studio Code then its even better.

Create a new page and copy following code:

from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
from langchain import OpenAI
import gradio as gr
import sys
import os

os.environ["OPENAI_API_KEY"] = ''

def construct_index(directory_path):
max_input_size = 4096
num_outputs = 512
max_chunk_overlap = 20
chunk_size_limit = 600

prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.7, model_name="text-davinci-003", max_tokens=num_outputs))

documents = SimpleDirectoryReader(directory_path).load_data()

index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)

index.save_to_disk('index.json')

return index

def chatbot(input_text):
index = GPTSimpleVectorIndex.load_from_disk('index.json')
response = index.query(input_text, response_mode="compact")
return response.response

iface = gr.Interface(fn=chatbot,
inputs=gr.inputs.Textbox(lines=7, label="Enter your text"),
outputs="text",
title="My AI Chatbot")

index = construct_index("docs")
iface.launch(share=True)

Once copied, you need to add your OpenAI key to the code before saving it.

Notice the OPEN_API_KEY variable in code? copy your OpenAI key, that we extracted in Step5, between the single quotes like this:

os.environ["OPENAI_API_KEY"] = 'your-key-goes-here'

and then save file with extension app.py in same location where you have your docs directory.

Notice how the doc folder and the app.py are at same level.

Step 7: Let the fun begin

Now we have everything in place, we can finally run the script and see the magic.

Navigate to where you have app.py and docs directory, in my case its in train directory on desktop, as you can see in screenshot above.

so I am going to open Terminal and run following command

I am now in ‘train’ directory. Next I am going to execute the python file:

python3 app.py

This will start training our custom chatbot. This might take some time based on how much data you have fed to it. Once done, it will output a link where you can test the responses using simple UI.

As you can see, it outputs local URL: http://127.0.0.1:7860

You can open this in any browser and start testing your custom trained chatbot. Notice that the port number above might be different for you.

You can ask questions on left side and it will respond in right column. Remember that questions will cost you tokens so more questions you ask, the more tokens will be gone from your OpenAI account. Training also uses tokens based on how much data you feed it.

Since I fed it data about a local organic honey company which specializes in pure, unaltered honey, lets see if it answers questions based on data I provided.

Organic Sidr honey is one of their specialities so it answers according to knowledge base. Now we can ask it more specific questions, because its trained on custom data.

Notice how the bot is context aware and knows that, You, in this context means the company and hence gives their contact number.

To train on more or different data, you can close using CTRL + C and change files and then run the python file again.

If this article was helpful, I will really appreciate a clapback and share. I will continue to share more data as I dive further into ChatGPT, starting with building a custom bot UI to integrate into websites.

IMPORTANT UPDATE:

With the help of Soheil Sarmadi, I was able to discover some issues which need code modifications to work.

Use below code if you get error similar to: got an unexpected keyword argument ‘llm_predictor’

Since gpt-index is now llama-index, following is the updated code which uses that:

from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex, LLMPredictor, ServiceContext
from langchain import OpenAI
import gradio as gr
import os

os.environ["OPENAI_API_KEY"] = '---your open ai key --'

def construct_index(directory_path):
num_outputs = 512

llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.7, model_name="text-davinci-003", max_tokens=num_outputs))

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

docs = SimpleDirectoryReader(directory_path).load_data()

index = GPTSimpleVectorIndex.from_documents(docs, service_context=service_context)

index.save_to_disk('index.json')

return index

def chatbot(input_text):
index = GPTSimpleVectorIndex.load_from_disk('index.json')
response = index.query(input_text, response_mode="compact")
return response.response

iface = gr.Interface(fn=chatbot,
inputs=gr.inputs.Textbox(lines=7, label="Enter your text"),
outputs="text",
title="Custom-trained AI Chatbot")

index = construct_index("docs")
iface.launch(share=True)

for this, you need to install llama-index instead of gtp-index i.e.

pip3 install llama-index

To make things easier, I am going to share all the versions which I installed:

python 3.11.2

pip 22.3.1

llama-index 0.5.4

PyPDF2 3.0.1

gradio 3.24.1

--

--

Sohaib Shaheen

Electrical Engineer, Full-Stack Web Developer, Cyber Security & Fitness Enthusiast.