Charting the Course of ChatGCV: A Journey into AI, ML, and Beyond

10 min readOct 16, 2023

From PDFs to RAG, OpenAI to Vector Databases: My Odyssey in Crafting Chat George’s Career Voyage

Introduction

Hello 👋 I’m George — a seasoned software engineer, leader, blockchain specialist and tech enthusiast. I’ve been in the world of tech for around 13 years professionally, have delved into many pieces of tech. From VMs, to containers to server-less applications. From on-prem deployments in data centres, Raspberry PIs at home and multiple cloud providers. I’ve worked with many data stores as well and many other facades of technology.

In 2019 (after some years of trading) I decided to be the guy for blockchain. So, I started working with smart contracts, understanding the ecosystem of blockchain technologies, and the world of crypto.

This year though, something new sparked my interest, I had previously worked with neural networks for fun, but nothing really came out of it, training your own models is overly expensive if you don’t have the hardware to assist in training, but OpenAI’s release of ChatGPT really re-sparked my interest in AI and ML and got me questioning how everything fits together.

I decided the only way for me to understand this, is to build my own version of ChatGPT, except, it would be purpose driven, it would know everything about me — George. I mean, what better way to understand if you doing things correctly than asking an AI things about yourself to test if it knows everything it should based on your training data?

Thus, the beginning of ChatGCV

The thought process

I had a simple use case. Use existing documentation about me (CVs, Cover letters, projects done, etc) and create an interface for anyone to ask things about George.

This gave me a basic list of requirements:

A user interface in a modern technology (React.js)
Must be hosted in a cloud platform (Azure)
Backend must be solid and secure (.NET 6)
Should be containerized for scalability (Azure Container Apps)
Should be delivered autonomously via a CI/CD pipeline (Azure DevOps)
Must use OpenAI.

The Challenges

Although I’ve had years of experience, I had never really sat down and understood the intricacies of AI and ML and how everything fits together, my challenge was purely a lack of understanding.

Learning AI/ML

It started by asking ChatGPT to act as if I was a solutions architect looking to get into AI/ML — then gave it this prompt.

“Could you create a course with some materials for me to become the best AI/ML architect I can be”.

It spat out a bunch of terms that I didn’t understand, and already I knew that I would need to spend a lot of time researching and understanding before I just delve into the code. #GetTheBasicsRight

It took hundreds of medium stories, YouTube videos, a short course on Udemy and two weeks of late nights for me to finally say, okay. I sort of understand this world

After I thought I understood this world of AI/ML — I decided to deep dive into the APIs available from OpenAI in their platform documentation. This gave me some insights in terms of some of the services available to me to consume. So, of course, I jumped into making some completions requests via postman to check out what kind of parameters I need to understand and what kind of responses I can expect. This was straight forward.

Embeddings and Vector Databases

In order for me to understand how to use my own data, in nontraditional ways where I just pass a blurb of information to OpenAI to have context, I saw some suggestions on using Vector databases with Semantic Search in order to query what you need and use that as part of the request to OpenAI. So, I drew out a little sequence diagram (Nothing special)

So, you may now be asking yourself, what are these things? Let’s break it down:

Embeddings

An embedding, in simple terms, is a way to represent complex data, like words or items, as points in a space (usually a high-dimensional space). Imagine you have a huge, messy closet full of different items: shoes, shirts, hats, and so on. Embedding is like organizing that closet in a way where similar items are placed close together. So, all sports shoes might be on one shelf, while formal shoes are on another.
In the context of machine learning, for example, words can be embedded in a space where similar words are close together. So, in this space, the word “king” might be close to “queen” but far from “apple”. This allows the computer to understand relationships between words or items based on their proximity in this space. — ChatGPT

The way an embedding is represented is a numerical array used to optimally query against via things like Semantic Search

So how do you structure them?

I spent a lot of time going back and forth on my embeddings, it went from just using something from LangChain called a PDFLoader to pretty much pull a PDF from a directory (my CV) and then extract the text, chunk it (break it up into smaller documents) and then embed them using OpenAI’s ADA embedding model and storing them in my Vector Database.

This worked, when I pulled data from the vector DB it had context, the problem was that the document was always pretty much the whole CV, I mean, it worked, but what’s the point of the semantic search then?

I needed to take a different approach. Really refine my data to make sense in context of the question (the search) and embed that and use the “answer” as the content to inject into the open AI request.

I went from around 4 Indexes in my Vector database to around 40 — by structuring a document (automagically) to be meaningful. This also brought down the token count of my requests from around 400 to only 60–100, which made it cheaper and the response more meaningful.

Semantic Search

Semantic search is a type of search method that focuses on understanding the meaning or intent behind a user’s query, rather than just matching keywords. It aims to provide more relevant and accurate results by considering the context, synonyms, and the relationships between words.

In essence, semantic search goes beyond mere word matching and delves into the meaning behind the words to offer more relevant search results.

Semantic search also gives us the concept of “scoring” — when using it, it essentially returns x number of results ranked highest to lowest in terms of similarity and relevance to the search made.

Vector Databases

We need a way to store these embeddings and query against them. There are many vector databases available, from ChromaDB, Redis, some postgress extensions, but I decided to go with Pinecone.

A vector database is essentially just the datastore for embeddings, which includes some meta data that includes information around the embedding.

As per the image “Embedding store” — after we have converted the data (text in my case) into an embedding, we store that data and its embedding representation into a vector database for later usage.

So how do we bring all these things together? How do we use our information alongside OpenAI to return relevant information about George?

RAG and Fine-Tuning

We investigate two main methods. RAG (Retrieval Augmented Generation) and Fine-tuning

RAG (Retrieval-Augmented Generation):

Purpose: RAG models are designed for tasks that involve both generating text and retrieving relevant information from a large external knowledge source, such as a database or the internet. These models combine a generative language model (like GPT-3) with a retrieval mechanism to enhance their ability to provide contextually relevant information.
Architecture: In a RAG model, there are two main components: a retriever and a generator. The retriever locates relevant information from the knowledge source, and the generator generates text based on that retrieved information. The retrieval process helps the model access specific information to answer questions or provide context-aware responses.
Use Cases: RAG models are particularly useful for question-answering tasks, information retrieval, and content summarization, where access to external knowledge is essential.

Fine-Tuning:

Purpose: Fine-tuning is a process where a pre-trained language model (such as GPT-3) is further trained on a specific dataset or task to adapt it to a particular application or domain. It allows the model to specialize in a specific task without requiring training from scratch.
Process: Fine-tuning involves taking a pre-trained model and exposing it to new data with labels or tasks relevant to the target application. The model’s weights are adjusted during this process to make it perform better on the specific task it’s being fine-tuned for.
Use Cases: Fine-tuning is commonly used for a wide range of NLP tasks, including sentiment analysis, named entity recognition, machine translation, text classification, and more. It allows the model to be adapted for domain-specific or task-specific requirements.

In essence, depending on your appetite, you are either creating a great way to query to get the data relevant and then using the retrieved data to query OpenAI or you are using a pretrained, or even new, model to expand on with your own knowledge base.

The latter requires a bit more of a deeper dive and cost overhead as training models is effectively GPU intense and can become expensive,.

RAG on the other hand, is just a fancy way of retrieving data, you can use traditional DB lookups or APIs to “enrich” your prompt, or do what I did, and use a semantic search over a vector database to get what I care about.

User Interface

We have a basic understanding of all the moving parts, now I needed a nice way to showcase all the moving parts in a digestible fashion. I tend to call myself more a backend engineer than a front-end engineer, CSS is my kryptonite.

I needed to make an interface that is like ChatGPT but obviously served a single purpose, accept a prompt, send the request through to my API to return the information around George - relevant to the prompt. scouring the internet, I came across some terrible open-source implementations of a “ChatGPT” interface, although, one did catch my eye. Chatpad. It has more than what I needed, so it did require some changes from my side, but essentially, embodied what I was looking for in terms of a simple interface to speak to my NLP.

It even had dark mode, a critical theme required to be a real engineer. 😎

Application Logical Design

The request process is:

User types a question and submits via the ChatGCV Client (React.js Front end)
Request is sent to ChatGCV Server (.NET 6)
Request is sent through to Pinecone semantic search to retrieve matching documents
Matching documents injected into OpenAI completions request
Response streamed back to ChatGCV Client

Key Takeaways

After spending some time building out ChatGCV, here are my key takeaways:

AI/ML is not actually that scary — Take some time, understand the process, research what you can and use ChatGPT and the thousands of resources available to learn about the world of AI/ML
Clean data is important — After many iterations of playing around with the correct structure to load my documents as an embed into my Pinecone Vector database, a simple Q/A structure stored in the Pinecone DB ended up being the best structure in terms of ChatGCV
Take note of your learnings — With anything you do, I advise keeping a little scratch pad available to write down some things you would like to further explore or just general key notes
Don't reinvent the wheel — For ChatGCV, I didn’t need to create my own LLM to only have information around me, I leveraged the 7 billion parameter GPT-4 model and my own data with RAG methods to achieve exactly what I needed.
Be ethical in your journey in AI — I'm planning on writing a whole blog around ethics in AI, but while negative testing ChatGCV, I found some interesting ways to “sway” OpenAI to give me wonderful responses even though untrue. So be cognitive in anything you build and how you prompt for your outcome

The future

I will be posting more stories and blogs around AI/ML/Blockchain and General engineering in the near future.

For now, ChatGCV will remain my AI learning playground.

Let me know if there is anything else you would like me to talk about or have any questions on the blog above.

I will be sharing the link to ChatGCV after some additions to the application itself.

Thanks for reading!