Revolutionizing Conversational AI with OpenAI Embeddings

Published in

Simform Engineering

8 min readApr 7, 2023

A step-by-step guide to enhance your chatbot capabilities using OpenAI embeddings.

Recently, OpenAI launched ChatGPT, an AI-powered chatbot trained with a large language model (LLM) that can answer almost anything accurately! And as if that’s not enough, it is continuously learning and developing.

Have you ever wondered if it could be utilized for your business or project? Well, guess what? It can be! There are different approaches to utilizing ChatGPT’s capabilities to enhance traditional chatbots.

Before We start

Before we get started, let’s get some ideas about OpenAI. OpenAI is a research and development company that provides different APIs for text completion, code completion (Codex), image generation (DALL-E), embedding, fine-tuning, and more.

In this example, we are going to utilize text completion and embedding APIs.

Let’s also understand what semantic search and embedding are. Semantic search is a way of searching by understanding the searcher’s intent, query context, and the relationship between words to generate accurate answers.

lexical search vs semantic search — credit: seobility

Embedding is the process of converting high-dimensional data to low-dimensional data as a list of real-valued numbers in the form of a vector in such a way that the two are semantically similar.

Now that we have understood these two concepts let’s dive into the interesting part!

Getting started

The main objective of this guide is to demonstrate how embeddings can be used to expand your bot’s knowledge. Currently, there are three main ways to extend your knowledge base to the GPT models:

Fine tuning: a straightforward approach, but you have no control over the model’s response apart from the initial prompt engineering.

Embeddings: a better approach to extend the model’s domain-specific knowledge, allowing more flexibility and control over the generated model output.

Codex: this approach helps if we have a SQL database as a data source. With this approach, we create and perform SQL queries on the database based on user input.

In this article, we will go through the embedding approach.

The Dataset

The first step towards creating a chat assistant is to prepare the data that will be used as a knowledge base. For this, we are utilizing a dataset. You can get the CSV file from here as well.

From the available columns in the dataset, we are more interested in the product title, product description, category, brand, price, and availability.

We will create a new column titled “text,” which contains data from all the columns mentioned above. We can use this new column (text) to generate embeddings.

data['text'] = "Category: " + data.Category + "; Product Title: " + data["Product Title"] + "; Product Description: " + data["Product Description"] + "; Brand: " + data.Brand + "; Price: " + data.Price + "; Stock Availability: " + data["Stock Availibility"]

Creating Embeddings

There are several models available for creating embeddings. The size of the embedding depends on the model that we choose. Below is a list of different OpenAI models, along with their respective embedding sizes.

Note that the higher the cost, the more dimensions the embeddings will have, resulting in more accurate results.

To proceed further, we need an OpenAI API key. You can create an account and get the API key from here. Additionally, we need to install an OpenAI client to access the API.

pip install openai

openai.api_key = "<API_KEY>"
model = 'babbage'
from openai.embeddings_utils import get_embedding
data['embeddings'] = data.text.a  pply(lambda x: get_embedding(x, engine=f'text-search-{model}-doc-001'))
data.head()

We can use any model to generate the embeddings. Here, we are choosing Babbage to balance between accuracy and cost.

We also need to generate a token count for each row. The following code snippet will add a column titled “n_tokens,” which contains the total count of tokens for each row (only for the “text” column).

# requirement
pip install transformers

from transformers import GPT2TokenizerFast


tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
data['n_tokens'] = data.text.apply(lambda x: len(tokenizer.encode(x)))

Here is a sample of the embedding column.

NOTE: If you get a rate limit here, perform this operation in batches.

Below snippet will do the job.

# Create embedding for each row in a loop.
# If rate limit hits, jump to exception, wait for 60 seconds and continue the job.
embed_list = []
for id, row in data.iterrows():
 try:
   embed = get_embedding(str(data.text[id]), engine=f'text-search-{model}-doc-001')
   embed_list.append(embed)
 except:
   print(id)
   sleep(60)
   embed = get_embedding(str(data.text[id]), engine=f'text-search-{model}-doc-001')
   embed_list.append(embed)
   Continue

data['embeddings'] = embed_list

Choosing a vector database

Now, we need to choose a vector database to store the embeddings we have generated. But wait, what is a vector database, and why do we need one?

In this era of AI/ML, we need tools/technology to store collections of representations of words, sentences, paragraphs, images, or documents called embeddings. With the focus on deep learning models in AI-based applications, we need systems to store and retrieve such large-sized data (vectors) in considerable quantities in real-time.

This is where a vector database comes into the picture. These databases can store and index embeddings and help perform semantic search rather than an exact match with higher speed, accuracy, and flexibility.

There are mainly two options to choose from:

Self-hosted open-source database
Managed cloud database

In the open-source category, we have options like Milvus, Weaviate, and Typesense. You may also use commercial services like Pinecone, Redis, or Faiss.

Here, we are going with the Milvus database. It’s an open-source service. You can find the installation guide here

# Requirements
pip install pymilvus python-dotenv

import os
from traceback import format_exc

from dotenv import load_dotenv
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema, connections
from pymilvus.exceptions import (
   CollectionNotExistException,
   MilvusException,
   SchemaNotReadyException,
)
load_dotenv()

DEFAULT_INDEX_PARAMS = {
   "metric_type": "L2",
   "index_type": "IVF_FLAT",
   "params": {"nlist": 2048},
}

INDEX_DATA = {
   "field_name": "embeddings",
   "index_params": DEFAULT_INDEX_PARAMS,
   "index_name": "amzn_semantic_search",
}
  
# You need to define these variables in environment file
def connect_db(
    alias: str = os.getenv("VECTOR_DB_ALIAS"),
    host: str = os.getenv("VECTOR_DB_HOST"),
    port: str = os.getenv("VECTOR_DB_PORT"),
    user: str = os.getenv("VECTOR_DB_USER"),
    password: str = os.getenv("VECTOR_DB_PASSWORD"),
):
    """Connect database

    Args:
        alias (str): Database (Collection) name.
        host (str): Database Host.
        port (str): Database Port.
        user (str): Database User.
        password (str, optional): Database Password.
    """
    connections.connect(
        alias=alias,
        host=host,
        port=port,
        user=user,
        password=password,
    )

# Fields
id = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, description="ID")
embeddings = FieldSchema(
   name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=2048, description="Embeddings"
)
metadata = FieldSchema(
   name="metadata", dtype=DataType.VARCHAR, max_length=20000, description="Metadata"
)
schema = CollectionSchema(
   fields=[id, embeddings, metadata], description="Amazon Product search"
)


# Create collection
def create_collection(name: str = "amzn_data"):
   collection = Collection(name=name, schema=schema, using="default", shards_num=2)
   return collection


def get_or_create_collection(
   name: str = "amzn_data",
   create_index: bool = True,
   index_data: dict = INDEX_DATA,
   load_data: bool = False,
):
   """Fetch collection object or create one
   Args:
       name (str, optional): Collection name. Defaults to "amzn_data".
       create_index (bool, optional): If True, create index. Defaults to True.
       index_data (dict, optional): If create_index=True, this data will be used to create index
       load_data (bool, optional): If True, insert data in the created collection. Defaults to False.

   Returns:
       Collection: Milvus collection
   """
   try:
       # Connect to the database
       connect_db()

       # Fetch the collection object by name
       collection = Collection(name)
   except Exception as excetion:
       print(excetion)
       print("Creating Collection...")
       
       # If collection is not available, create a collection
       collection = create_collection(name=name)

       # Create index if unavailable
       # Here we will provide index_data (INDEX_DATA) that we have defined above.
       if create_index and index_data:
           collection.create_index(**index_data)
       if load_data:
           # We need to  pass the latest dataframe with n_tokens and embeddings columns available.
           insert_data(collection_name=collection, dataframe=df)
   finally:
       collection.load()
       return collection


def insert_data(collection_name: Collection, dataframe):
   """Insert Data into database"""
   try:
       final_values = []
       index_list = [i for i in range(len(dataframe["embeddings"]))]
       emb_list = dataframe["embeddings"].to_list()
       metadata_list = [
           {
               "Category": row["Category"],
               "Price": row["Price"],
               "Brand": row["Brand"],
               "Stock Availibility": row["Stock Availibility"],
               "Image Urls": row["Image Urls"],
               "n_tokens": row["n_tokens"],
           }
           for _, row in dataframe.iterrows()
       ]

       final_values.append(index_list)
       final_values.append(emb_list)
       final_values.append(metadata_list)

       collection_name.insert(final_values)
   except Exception as exception:
       print(exception)
       print(format_exc())

Please note that in the above code, most of the functions are customized for the dataset we are using. For any other dataset that has different columns, please make the necessary modifications.

Although the choice of distance function does not matter much, it’s suggested to choose cosine similarity in the official documentation.

Chatbot workflow

As visible in the image below, we will generate embeddings for the question and perform a semantic search in the vector database where we have stored embeddings for the dataset.

Once we get the feasible set of answers from the database, we pass it to the Completion method of OpenAI along with some instructions to generate the response in proper format with complete sentences.

The result

Based on the question, the chatbot will provide answers in human-like language. If the answer is unavailable, it will ask to rephrase the question or mention that it is out of scope.

Limitations

Due to how contexts are being retrieved, the bot can only carry a conversation about one topic at a given time. Asking it a different topic question amidst an ongoing chat will result in it being confused by the previous context, and it will no longer generate accurate results, although it can sound very convincing!
To overcome this, we have a reset chat option available.
Sometimes, it might generate answers that seem pretty convincing but could be incorrect.

TL;DR

OpenAI’s embedding chatbot uses advanced machine learning techniques to generate natural and contextually relevant responses to user input. The chatbot generates embeddings for user questions, performs a semantic search in a vector database, and uses openAI’s Completion method to generate responses.

While the chatbot’s responses are not yet indistinguishable from those of a human, they are advanced enough to provide value in many real-world scenarios. The technology has numerous applications in industries such as customer service, healthcare, and education, where having human-like conversations with machines can greatly improve efficiency and accessibility.

Conclusion

The development of OpenAI’s embedding chatbot is a significant advancement in the field of natural language processing and conversational AI. This technology has numerous applications in industries such as customer service, healthcare, and education, where having human-like conversations with machines can greatly improve efficiency and accessibility.

However, it is important to note that while the technology is highly advanced, there is still a long way to go in terms of making chatbots truly indistinguishable from humans in terms of conversational abilities. Nonetheless, OpenAI’s embedding chatbot represents a significant step forward in this direction and will continue to push the boundaries of what is possible with conversational AI.

Follow Simform Engineering to keep yourself updated with the latest trends in the technology horizon.

References

Beyond Semantic Search. — webinar by OpenAI & Pinecone
OpenAI Embeddings
Exploratory Colab Notebook