Creating Custom ChatGPT with Your Own Dataset using OpenAI GPT-3.5 Model, LlamaIndex, and LangChain

Happy LLM

(λx.x)eranga

Published in

Effectz.AI

10 min readAug 22, 2023

Large Language Model(LLM)

A Large Language Model(LLM) is a type of artificial intelligence (AI) algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new content. The term generative AI also is closely connected with LLMs, which are, in fact, a type of generative AI that has been specifically architected to help generate text-based content. LLMs are purpose-built and extensively trained for natural language processing tasks. These models undergo training on vast quantities of text data, enabling them to generate text that closely resembles human language. They possess the ability to grasp contextual nuances and provide answers to questions. Moreover, LLMs can also be fine-tuned for particular tasks such as translation, summarization, and sentiment analysis. The GPT(Generative Pre-trained Transformer) model series which developed by OpenAI is a well-known example of LLM. These GPT models are the core components of the widely recognized ChatGPT application, which we will delve into in the following section.

GPT Models

OpenAI is the research organization that pioneered the development of the GPT model series. These models have been trained to understand natural language and code and produce text outputs in response to their inputs. Their GPT-3 and GPT-4 models(which used to build the well-known ChatGPT app) are game-changers. Before we got GPT-3 and GPT-4 , there were GPT-1 and GPT-2, both impressive language models but with limitations in their datasets and capabilities. GPT-3 has 175 billion parameters that enable it to provide human-life responses. Often it isn’t easy to differentiate between the responses by GPT-3 from a human response. Now the world is waiting for GPT-4, a better version of GPT-3. OpenAI’s most advanced system, GPT-4, has 100 trillion parameters, making it more prominent and influential. If you’re interested in delving deeper into how GPT models are constructed and trained, I recommend referring to this comprehensive research paper.

ChatGPT

ChatGPT is a web-based chatbot application that has been specifically designed and fine-tuned for optimal dialogue interactions. It leverages OpenAI’s powerful GPT-3 model to facilitate seamless and engaging conversations with humans. The focus of ChatGPT lies in creating dialogues, enabling it to generate text in a chat-like fashion for tasks such as code explanations or even composing poems. Essentially, ChatGPT functions as an application, with the GPT-3 model serving as its underlying intelligence. This nomenclature, ChatGPT, stems from the fact that it is a chat-oriented application built upon the foundation of the GPT model.

OpenAI API

As I mentioned previously, Open AI built the GPT LLM model series including GPT-3 and GPT-4. Using these GPTs, you can build applications to Draft documents, Write computer code, Answer questions about a knowledge base, Analyze texts etc. OpenAI provides APIs to interact and use these models in our own applications. To use a GPT model via the OpenAI API, we need to send a request containing the inputs and your API key, and receive a response containing the model’s output.

The models like GPT-3, GPT-4 come pre-trained on massive public datasets, allowing for incredible natural language processing capabilities out of the box. However, their utility is limited without access to your own private data. OpenAI offers APIs that allow us to harness the power of their models using custom datasets. This means we can train GPT models using our proprietary data and integrate these models into our applications. In this illustration, I will delve into the process of training the gpt-3.5-turbo model using a collection of research papers provided in PDF format. Subsequently, I will demonstrate the creation of a Chatbot akin to ChatGPT, capable of responding to inquiries based on the content within these research papers.

LlamaIndex

LlamaIndex(previously known as gpt-index) is a data framework which provides a simple, flexible interface to connect LLMs with external data(e.g your private data). It lets developers to connect data from files like PDFs, PowerPoints, apps such as Notion and Slack and databases like Postgres and MongoDB to LLMs. The framework includes connectors to ingest data sources and data formats, as well as ways to structure data so that it can be easily used with LLMs. This data is indexed into intermediate representations optimized for LLMs. LlamaIndex then allows natural language querying and conversation with your data via query engines, chat interfaces, and LLM-powered data agents. It enables your LLMs to access and interpret private data on large scales without retraining the model on newer data.

LlamaIndex creates a vectorized index from your document data, making it highly efficient to query. It then uses this index to identify the most relevant sections of the document based on the similarity between the query and data. The retrieved information is then incorporated into the prompt sent to GPT model, providing it with the necessary context to answer your question.

LangChain

LangChain is a robust library designed to streamline interaction with large language models (LLMs) providers like OpenAI. It supports other LLM providers as such as Cohere, Bloom, Huggingface as well. LangChain’s unique proposition is its ability to create Chains, which are logical links between one or more LLMs.

The complexity of LLMs, with their frequent updates and large number of parameters, has created intense competition among providers. To simplify the process of utilizing these models, LangChain provides APIs that abstract away many of the challenges associated with cloning code, downloading trained weights, and manually configuring settings. Basically
LangChain provides an application programming interface (APIs) to access and interact with LLM and facilitate seamless integration, allowing you to harness the full potential of LLMs for various use cases.

LlamaIndex effectively employs LangChain’s LLM modules and offers the flexibility to customize the underlying LLM used — with the default option being OpenAI’s text-davinci-003 model. The selected LLM serves as the foundation for constructing responses within LlamaIndex and sometimes plays a role during the index creation process as well.

The seamless combination of LlamaIndex and LangChain provides an effortless approach to training GPT models with proprietary datasets and developing applications atop them. The following steps outline the process of training a GPT model with custom data and creating a Chatbot application using that model. In this scenario, I’ve utilized the GPT-3.5 model (gpt-3.5-turbo). Data indexing is achieved using LlamaIndex, while integration with the OpenAI API is facilitated by LangChain. You can find all the relevant source code for this post on GitLab. Simply clone the repository and follow the steps below to proceed.

1. Install required packages

First, you need to install the following required Python packages: openai, PyPDF2(a Python library for reading PDF files), llama_index, langchain, and gradio(a Python UI library).

pip install openai
pip install PyPDF2
pip install langchain==0.0.148
pip install llama-index==0.5.6
pip install gradio

2. Create OpenAI API Key

To engage with OpenAI’s APIs for utilizing the GPT models, an API key must be generated. This API key can be obtained from platform.openai.com/account/api-keys.

The generated API key should be set as the environment variable OPENAI_API_KEY within the program.

import os

os.environ["OPENAI_API_KEY"] = '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'

Next, visit platform.openai.com/account/usage and ensure that you have sufficient credits remaining. If you’ve utilized all your free credits, you’ll need to add a payment method to your OpenAI account or create a new OpenAI account with a different email address. Then, generate an API key from that account.

3. Create LlamaIndex

This step entails the creation of a LlamaIndex by utilizing the provided documents. In my case, I employed research papers to train the custom GPT model. These research papers were consolidated within a designated directory named docs, serving as the foundation for constructing the LlamaIndex. Throughout the index creation process, LlamaIndex engages with the OpenAI text embedding API through the LangChain framework. The resulting index is subsequently saved as an index.json file, serving as a repository for future use. Importantly, the index need not be generated every time; it can be constructed once, stored, and later leveraged for query purposes.

from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex, LLMPredictor, ServiceContext, PromptHelper
from langchain.chat_models import ChatOpenAI
import gradio as gr
import sys

def init_index(directory_path):
    # model params
    # max_input_size: maximum size of input text for the model.
    # num_outputs: number of output tokens to generate.
    # max_chunk_overlap: maximum overlap allowed between text chunks.
    # chunk_size_limit: limit on the size of each text chunk.
    max_input_size = 4096
    num_outputs = 512
    max_chunk_overlap = 20
    chunk_size_limit = 600

    # llm predictor with langchain ChatOpenAI
    # ChatOpenAI model is a part of the LangChain library and is used to interact with the GPT-3.5-turbo model provided by OpenAI
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
    llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo", max_tokens=num_outputs))

    # read documents from docs folder
    documents = SimpleDirectoryReader(directory_path).load_data()

    # init index with documents data
    # This index is created using the LlamaIndex library. It processes the document content and constructs the index to facilitate efficient querying
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

    # save the created index
    index.save_to_disk('index.json')

    return index

4. Querying GPT model

After the index is generated, it can be preserved and employed for data querying purposes. When a user submits a question, the system initiates by searching for pertinent segments within the index. These identified document segments are subsequently paired with the user’s query and transmitted to the GPT model API(gpt-3.5-turbo) via the LangChain framework. The ensuing response generated by the model is then presented to the user, offering a comprehensive and personalized response that directly addresses their query.

from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex, LLMPredictor, ServiceContext, PromptHelper

def chatbot(input_text):
    # load index
    index = GPTSimpleVectorIndex.load_from_disk('index.json')

    # get response for the question
    response = index.query(input_text, response_mode="compact")

    return response.response

5. Build UI

To facilitate interaction with the custom-trained chatbot, I’ve developed a simple user interface(UI) using the gradio library. This UI provides interface for users to engage with the chatbot and receive responses based on their input queries.

import gradio as gr

# create ui interface to interact with gpt-3 model
iface = gr.Interface(fn=chatbot,
                     inputs=gr.components.Textbox(lines=7, placeholder="Enter your question here"),
                     outputs="text",
                     title="Frost AI ChatBot: Your Knowledge Companion Powered-by ChatGPT",
                     description="Ask any question about rahasak research papers",
                     allow_screenshot=True)
iface.launch(share=True)

6. Full Program

Here is the complete program. I’ve stored it in a file named model.py and executed it.

import os

os.environ["OPENAI_API_KEY"] = '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'

from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex, LLMPredictor, ServiceContext, PromptHelper
from langchain.chat_models import ChatOpenAI
import gradio as gr
import sys

def init_index(directory_path):
    # model params
    # max_input_size: maximum size of input text for the model.
    # num_outputs: number of output tokens to generate.
    # max_chunk_overlap: maximum overlap allowed between text chunks.
    # chunk_size_limit: limit on the size of each text chunk.
    max_input_size = 4096
    num_outputs = 512
    max_chunk_overlap = 20
    chunk_size_limit = 600

    # llm predictor with langchain ChatOpenAI
    # ChatOpenAI model is a part of the LangChain library and is used to interact with the GPT-3.5-turbo model provided by OpenAI
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
    llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo", max_tokens=num_outputs))

    # read documents from docs folder
    documents = SimpleDirectoryReader(directory_path).load_data()

    # init index with documents data
    # This index is created using the LlamaIndex library. It processes the document content and constructs the index to facilitate efficient querying
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

    # save the created index
    index.save_to_disk('index.json')

    return index

def chatbot(input_text):
    # load index
    index = GPTSimpleVectorIndex.load_from_disk('index.json')

    # get response for the question
    response = index.query(input_text, response_mode="compact")

    return response.response

# create index
init_index("docs")

# create ui interface to interact with gpt-3 model
iface = gr.Interface(fn=chatbot,
                     inputs=gr.components.Textbox(lines=7, placeholder="Enter your question here"),
                     outputs="text",
                     title="Frost AI ChatBot: Your Knowledge Companion Powered-by ChatGPT",
                     description="Ask any question about rahasak research papers",
                     allow_screenshot=True)
iface.launch(share=True)

To execute the program, simply use the command python model.py. This command will initiate the creation of an index using the data located within the docs folder, subsequently saving it as index.json. It's important to be aware that during the index creation process, rate limits might be encountered from the OpenAI API, particularly based on the size of the documents being processed. The gradio web app will start at http://127.0.0.1:7860. Additionally, the web app can be accessed via the public URL https://dc97c2bc37874fa808.gradio.live. Users can conveniently engage with the Chatbot using these URLs.

❯❯ python model.py
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 56297 tokens
/Users/lambda.eranga/Workspace/rahasak/labs/openai/chitra/model.py:43: GradioDeprecationWarning: `allow_screenshot` parameter is deprecated, and it has no effect
  iface = gr.Interface(fn=chatbot,
Running on local URL:  http://127.0.0.1:7860

Thanks for being a Gradio user! If you have questions or feedback, please join our Discord server and chat with us: https://discord.gg/feTf9x3ZSB
Running on public URL: https://dc97c2bc37874fa808.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)