How to Improve Search with Conversational AI

OpenAI Embeddings: A Midjourney Documentation Case Study

Ivan Campos
Sopmac AI
9 min readFeb 9, 2023

--

The integration of ChatGPT into Bing by Microsoft is shaking up the search engine world and challenging Google’s dominance. In this post, we will explore the potential of conversational AI in search engines.

This article will show you how to Build a conversational AI system using OpenAI and AWS Lambda:

  • You’ll scrape Midjourney’s documentation and create embeddings using the OpenAI API. The embeddings will be stored in a .csv file and ingested by an AWS Lambda function which will compare them against a search query passed into the function’s URL.
  • The results will be tested using CURL and displayed in the command line terminal.

Why?

Inspired by a tutorial from OpenAI and driven by a passion for Midjourney, I set out to prove the benefits of a conversational approach over traditional search engines, delivering a more intuitive and engaging experience.

C₈H₁₁NO₂ — @sopmacArt on twitter. #midjourney

Now, let’s dive into the exciting part — the code

Jupyter Notebook

Note: Creating embeddings for all Midjourney’s public docs cost $0.06 and the resulting .csv file is 8MB.

tl;dr

This code uses several libraries such as requests, BeautifulSoup, re, deque, pandas, tiktoken, openai, numpy and openai.embeddings_utils to perform different tasks. The code starts with a web crawling script that fetches the HTML content of each page, extracts the text content, and saves it in text files. The pandas library is used to process the text files and create a DataFrame that stores the file name and text content. The tiktoken library is used to tokenize the text and calculate the number of tokens. The code then splits the text into chunks of a maximum number of tokens if necessary. The openai library is used to generate embeddings for the text data. The code uses the numpy library to convert the embeddings from a string representation of a list to a numpy array. The openai.embeddings_utils module is used to calculate distances between embeddings using cosine similarity.

Comprehensive Code Walkthrough

This code begins with a web crawling script that starts at a specific URL (full_url = 'https://docs.midjourney.com/') and traverses the site by following the hyperlinks on each page. The script uses the requests library to fetch the HTML content of each page, the BeautifulSoup library to parse the HTML and extract the text content, and the re library for regular expression matching. The script also uses the deque data structure from the collections module to store the URLs to be crawled and keep track of which URLs have already been seen. The script saves the text content of each page to a separate text file and stops the crawl if it encounters a page that requires JavaScript.

This code is using the pandas library to process the text files that were generated from the previous web crawling script. The code first creates a list texts to store the text files, and then loops through all the text files in the text directory, reading the text content of each file and storing the file name and text content as a tuple in the texts list. The code then creates a pandas DataFrame df from the texts list, where each row represents a text file, with columns named fname and text to store the file name and text content, respectively. The code then modifies the text column to include the file name and removes newlines from the text. Finally, the code saves the DataFrame to a CSV file named scraped.csv in the processed directory, and displays the first few rows of the DataFrame using the head() method.

This code uses the tiktoken library to tokenize the text data stored in the scraped.csv file. The code first loads the cl100k_base tokenizer from the tiktoken library, which is designed to work with the ada-002 model. The code then reads the scraped.csv file into a pandas DataFrame df and renames the columns to title and text. The code then tokenizes the text in each row of the DataFrame using the encode method of the tokenizer and saves the number of tokens to a new column n_tokens in the DataFrame. Finally, the code generates a histogram to visualize the distribution of the number of tokens per row.

This code defines a function split_into_many that splits the text data in the pandas DataFrame into chunks of a maximum number of tokens, specified by the max_tokens parameter. The function first splits the text into sentences, then calculates the number of tokens for each sentence. The function then uses a loop to iterate through the sentences and tokens and builds up chunks of text until the number of tokens in the chunk reaches the maximum number of tokens. The code then uses the iterrows method to loop through the rows of the DataFrame and checks if the number of tokens for each row is greater than the max_tokens value. If the number of tokens is greater than the maximum, the code splits the text into chunks using the split_into_many function. If the number of tokens is less than the maximum, the code appends the text to a list of shortened texts. Finally, the code creates a new DataFrame from the shortened texts, calculates the number of tokens for each row, and generates a histogram to visualize the distribution of the number of tokens.

This code uses the openai library to generate embeddings for the text data stored in the DataFrame. The code first sets the OpenAI API key, which is required to access the OpenAI API. The code then uses the apply method to generate an embedding for each row of the DataFrame using the Embedding.create method from the openai library. The input parameter specifies the text to generate the embedding for, and the engine parameter specifies the text-embedding model to use (in this case, the ada-002 model). The code then stores the embeddings in a new column embeddings in the DataFrame, and saves the DataFrame to a CSV file named embeddings.csv. Finally, the code displays the first few rows of the DataFrame using the head() method.

This code imports the pandas and numpy libraries, and the distances_from_embeddings and cosine_similarity functions from the openai.embeddings_utils module. The code then reads the embeddings.csv file into a pandas DataFrame and converts the embeddings column from a string representation of a list to a list of numpy arrays. The code uses the apply method and the eval function to evaluate the string representation of the list, and the np.array function to convert the list to a numpy array. Finally, the code displays the first few rows of the DataFrame using the head() method.

AWS Lambda

Note: Renamed embeddings.csv to docs.midjourney.com.csv in S3

If we were to run this function 10,000 times in a single month with an expected response time of 10 seconds per call, the total would be…FREE.

— see References below for cost calculation breakdown

tl;dr

This code is a AWS Lambda function that uses the OpenAI API to answer questions based on a set of text data stored in a CSV file in S3. It generates an embedding for each text, finds the most similar text to the question, creates a context from the most similar texts, and then generates an answer to the question using the OpenAI Completion API.

Comprehensive Code Walkthrough

This code is a serverless function deployed on AWS Lambda that serves as an API endpoint to answer questions. The function is triggered by a REST API call with a question as a query parameter. The function first sets the OpenAI API key from an environment variable. It then retrieves the embeddings data from a CSV file stored in an Amazon S3 bucket and loads it into a pandas DataFrame. The function then calls the answer_question function to answer the question by finding the most similar context from the data in the DataFrame and using the OpenAI API to generate a response based on the context and the question. The answer_question function uses the OpenAI API to create a context for the question, which is done by finding the most similar text in the DataFrame and concatenating it into a context string. The answer_question function then uses the OpenAI API to generate a response based on the context and the question. The code also includes helper functions to calculate cosine distance and distances from embeddings.

Testing

We’ll use CURL to test the search function by sending a request to the following URL: https://YOUR_FUNCTION_URL_ID.lambda-url.us-east-1.on.aws/?query=%22your%20query%22

CURL Template

curl -G - data-urlencode "query=YOUR QUERY" 
"https://YOUR_FUNCTION_URL_ID.lambda-url.us-east-1.on.aws/"

Next, we will compare results from a simple query between the search function and docs.midjourney.com.

TEST-1: “what is midjourney”

Midjourney is a communications technology incubator that provides image generation services to augment human creativity and foster social connection.

TEST-2: “HOW TO CHANGE ASPECT RATION”

Use Aspect Ratio Parameters Add --aspect <value>:<value>, or --ar <value>:<value> to the end of your prompt.

Note: notice how ratio was misspelled as “ration” and the response still processed as expected.

TEST-3: “what is --ar”

--ar is a parameter that can be added to the end of a prompt to change the aspect ratio.

TEST-4: “where is the discord

The Midjourney Discord can be found at https://discord.gg/midjourney.

TEST-5: “where is the twitter”

I don’t know.

Note: Both are correct responses as Midjourney does not have a twitter account.

TEST-6: “what is the pro plan”

The Pro Plan includes 30 hr/month of Fast GPU time, unlimited Relax GPU time, the ability to work solo in your Direct Messages, Stealth Mode, 12 concurrent Fast Jobs, 3 concurrent Relaxed Jobs, and 10 Jobs waiting in queue. Rate Images to Earn Free GPU Time is also included, as well as General Commercial Terms for usage rights.

TEST-7: “who owns the copyrights”

You own all Assets you create with the Services, subject to the license granted to Midjourney.

Conclusion

In conclusion, Microsoft’s integration of ChatGPT into Bing has brought a new perspective to the search engine wars and is shaking things up by challenging Google’s dominance.

Google’s Response to Microsoft integrating ChatGPT into Bing

This post highlighted the potential of conversational AI in search engines and showed that it can provide a more appealing and context-rich experience for users compared to traditional search engines.

While our cost-effective solution has demonstrated promising results, it is not without its limitations, such as longer response times (6–12 seconds) and the cost of using the text-davinci-003 model.

Nevertheless, this is an exciting time for the search engine industry and we are looking forward to seeing how it continues to evolve.

Related Resources

References

AWS Lambda Cost Calculations

--

--

Ivan Campos
Sopmac AI

Exploring the potential of AI to revolutionize the way we live and work. Join me in discovering the future of tech