Building a private GPT with Haystack, part 2: code in detail

Felix van Litsenburg
6 min readJul 26, 2023

--

This article explains in detail how to build a private GPT with Haystack, and how to customise certain aspects of it. A private GPT allows you to apply Large Language Models (LLMs), like GPT4, to your own documents in a secure, on-premise environment. This puts into practice the principles and architecture outlined in the previous post.

Revisiting the architecture

Architecture for private GPT using Promptbox

Recall the architecture outlined in the previous post. We use Streamlit for the front-end, ElasticSearch for the document database, Haystack for orchestration, and HuggingFace for our models.

Let’s first download everything that we need to get. This consists of the following:

  1. The Promptbox repo which contains the code for our front-end, which also handles the bulk of the orchestration
  2. The Haystack repo, which contains one of our key libraries, as well as the capabilities to run orchestration
  3. The Large Language Models (LLMs) from HuggingFace. A very lightweight option is Google’s flan-t5-base

The requirements are:

  1. Python, we use 3.8 here
  2. An environment manager (conda or pyenv)
  3. Git
  4. Docker
  5. We run on Ubuntu 20.04

Promptbox can run on both a GPU and a CPU, although the former will of course be much quicker! You do not need to configure any settings to access GPU, it should automatically detect if you run a compatible GPU.

Downloading the necessities

Alright, let’s get to it. Assuming you’re in your home folder in Ubuntu, let’s carry out the following commands.

Downloading the language models

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

mkdir hf
cd hf

git clone https://huggingface.co/google/flan-t5-base

This will create a new directory, ‘hf’ (for hugging face), where the language models are stored locally. The Google flan-t5-base model will be stored there.

Downloading Promptbox and Haystack

git clone https://github.com/fvanlitsenburg/promptbox.git
git clone -b v1.18.0 https://github.com/deepset-ai/haystack.git

So, we’ve downloaded Promptbox and Haystack. Let’s install Haystack and get it running in Docker. First, however, make sure to activate the virtual environment you want to run this in!

pip install --upgrade pip

# Install Haystack in editable mode
cd haystack
pip install -e '.[all]'

# Run Haystack
docker-compose up

With Haystack running, we have ticked off the document database as well as document preprocessing. Running docker-compose up spins up the ‘vanilla’ Haystack API which covers document upload, preprocessing, and indexing, as well as an ElasticSearch document database.

To continue, open a new terminal window and install Promptbox’s requirements:

# Install Promptbox requirements
cd Promptbox
pip install -r requirements.txt

Running Promptbox

Simply run Promptbox using Streamlit to kick things off.

# Run Promptbox through Streamlit
streamlit run ui/Home.py

The following should appear in your terminal:

Click on the Network URL link and Promptbox will begin loading. You will see the text “Running load_models()” appear on screen, as the flan-t5-base model is being loaded.

It will look like this once loaded:

And there you have it! Click through the menu on the left to try out the use cases presented in the previous document.

Customising Promptbox

Using multiple Large Language Models

In this demo, Promptbox uses Google’s flan-t5-base only. It is, however, possible to use multiple LLMs. Be wary of the intense computational requirements that LLMs require to run, and don’t use too many!

Adding a new LLM for Promptbox to use is very simple.

  1. Select a text-2-text generation LLM from HuggingFace’s hub, for example Fastchat
  2. Download the LLM:
cd hf
git clone https://huggingface.co/lmsys/fastchat-t5-3b-v1.0

3. Add it to the ‘modellist’ list on ui/Home.py:

st.session_state['modellist'] = ['flan-t5-base','fastchat-t5-3b-v1.0']

From this point on, the model will be available to use in the two use cases.

Caching models

If you are using multiple models, you may want to think about caching them. This will save you the hassle of reloading them for a second or third use. Promptbox, by default, stores your models in cache. If you’re running on GPU, especially with larger models, this might cause some out of memory models.

To have only one model in cache in any time, go to ui/Home.py and change the following lines:

modelbase = load_models(st.session_state.modellist) # Comment this line to run with only one model in cache

# modelbase = load_models(['flan-t5-base']) # Uncomment this line to run with only one model in cache

Using Retrieval-Augmented Generation with Embeddings

It is relatively straightforward to run Retrieval-Augmented Generation (RAG) with the current set-up. In fact, as Promptbox progresses, this might be an interesting feature to add.

Retrieval-Augmented Generation essentially means: skip the step of finetuning an LLM on your own data. Instead, give it the most relevant information from your own documents as context to generate more accurate information.

Strictly speaking, Promptbox already is a “light” form of RAG. We use a BM25 retriever to extract relevant passages, and feed this as context to the LLM. Most instances of RAGs, however, use vector databases such as Weaviate, with embedded documents, for the retrieval component.

If you have made it this far, you probably already know what embeddings and vector databases are. In case you do not, this article explains it relatively well. The same words can have different meanings, or different words can have similar meanings. For example, a tree can refer to a plant, but also to a mathematical concept. A birch is a type of tree. In the vector representation of words, the tree in “Johnny climbed a tree” will be closer to birch than the tree in “Johnny realised every tree was a bipartite graph”.

Using embeddings means that if we ask a RAG “Should the president produce more cars?”, the LLM is more likely to be fed context about the president of Ford motors than the President of the USA.

Enough context, let’s build it. We need to adjust our setup so that a) documents are embedded, and b) retrieval uses those embeddings.

We go to our Haystack folder, and then into rest_api/rest_api/pipeline/pipelines.haystack-pipeline.yml and we change the components section into the below:

components:    # define all the building-blocks for Pipeline
- name: DocumentStore
type: ElasticsearchDocumentStore
params:
host: localhost
embedding_dim: 384 # we need to clarify the dimensions of the embedding model
- name: Retriever
type: EmbeddingRetriever # this replaces the BM25 retriever
params:
document_store: DocumentStore
top_k: 10
embedding_model: sentence-transformers/all-MiniLM-L6-v2

- name: Reader # custom-name for the component; helpful for visualization & debugging
type: FARMReader # Haystack Class name for the component
params:
model_name_or_path: deepset/roberta-base-squad2
context_window_size: 500
return_no_answer: true
- name: TextFileConverter
type: TextConverter
- name: PDFFileConverter
type: PDFToTextConverter
- name: Preprocessor
type: PreProcessor
params:
split_by: word
split_length: 1000
- name: FileTypeClassifier
type: FileTypeClassifier

Now, whenever you upload a document, it will be preprocessed and then embedded into a 384-dimension vector space.

To make sure that Promptbox also correctly retrieves these embedded documents, make these changes in utils.py:

ESdocument_store = ElasticsearchDocumentStore(index="document",host='localhost') # Comment to change to embedding retrieval

# ESdocument_store = ElasticsearchDocumentStore(index="document",host='localhost',embedding_dim=384) # Uncomment for embedding retrieval

Retriever = BM25Retriever(document_store=ESdocument_store) # Comment to change to embedding retrieval

#Retriever = EmbeddingRetriever(document_store=ESdocument_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2", model_format="sentence_transformers", top_k=5) # Uncomment for embedding retrieval

And there we go! Now, Promptbox embeds your documents automatically, and uses sentence-transformers to retrieve the embedded documents.

Next steps

Currently, Promptbox is a proof-of-concept that can be enhanced as follows:

  1. Dockerizing certain steps, such as automatically creating a UI image, downloading Haystack
  2. Moving certain elements to environment variables or a config file, for example which language models to use, or whether to use embeddings or not
  3. Creating more routing options — at the moment, users can ‘route’ documents with the BinarySelection tool. A document that has a ‘no’ answer will be analysed further, a document that gets a ‘yes’ answer will be skipped. Here, I can imagine many more use cases. Classification into more than ‘yes’/’no’ would be one, or giving users the option to create a more sophisticated chain (e.g. treating ‘no’ documents differently from ‘yes’ documents, or comparing documents to each other).
  4. Leveraging the Haystack API more — currently, the generative and retrieval components of Promptbox run on ui/utils.py. With the proper setup, this could be done through a Haystack pipeline. This would requiring some custom nodes and adapting the Haystack plumbing quite a bit
  5. Giving users more flexibility and control by creating a drag-and-drop interface, similar to Flowise, that allows them to set up their own workflows.

--

--