Create a Serverless Search Engine using the OpenAI Embeddings API

OpenAI’s Text Embedding Model on AWS Lambda

Ivan Campos
Sopmac AI
8 min readJan 28, 2023

--

Get ready to take your text analysis to the next level with powerful code that uses the OpenAI Text Embedding model to transform your words and phrases into embeddings, which allow for advanced text analysis and comparisons. Ultimately resulting in semantic search that can be immediately be deployed to AWS Lambda.

@SopmacArt on twitter

General Outline

  • The OpenAI Text Embedding Model
  • Embeddings & the OpenAI API
  • Working Code (Python / AWS Lambda Function)
  • Pricing for text-embedding-ada-002
  • Real-World Inspirations and Resources

The OpenAI Text Embedding Model

The OpenAI Text Embedding model is a machine learning model that converts text data into numerical vectors, known as embeddings.

The embeddings are trained on a large corpus of text and have been fine-tuned to produce high-quality representations of words and phrases.

OpenAI has several versions of text embedding models with different architectures, such as the text-embedding-ada-002 model, which we will code against today.

The OpenAI Embeddings API

The OpenAI Embeddings API allows for easy access to the powerful text embedding model created by OpenAI. This means that developers can easily integrate the model into their applications and perform advanced text analysis tasks without having to train their own model or manage the infrastructure required to run it.

Additionally, the API endpoint abstracts away the complexity of the underlying model, making it accessible to developers with varying levels of machine learning expertise.

Furthermore, the OpenAI Embeddings API allows to access the state-of-the-art models and uses the latest research in the field, which is a valuable resource for many companies that would not have the resources to maintain such models in-house.

Finally, the API also allows for easy scaling, which means companies can increase their usage as their needs grow.

Note: If you want to get started using the OpenAI API with JavaScript, check out:

OpenAI Embedding Use Cases

OpenAI Embeddings can be used in a variety of natural language processing tasks, such as:

  1. Text classification: The embeddings can be used to train machine learning models for text classification tasks, such as sentiment analysis and spam detection.
  2. Text clustering: The embeddings can be used to cluster text into groups of similar documents, which can be useful for tasks such as document summarization and topic modeling.
  3. Named entity recognition: The embeddings can be used to train models for identifying named entities in text, such as people, organizations, and locations.
  4. Text similarity: The embeddings can be used to determine the semantic similarity between two pieces of text. — ✨ THE USE CASE FOR OUR PYTHON CODE BELOW

These are just a few examples, the possibilities are endless depending on your project and use case.

Working Code (Python — AWS Lambda Function)

The following code provides several key values:

  • Utilizes the power of OpenAI’s text embedding model to analyze large amounts of text data and find the most semantically similar text to a given search term, which would otherwise be time-consuming and prone to human error.
  • Returns the results in an easily accessible format, a JSON object, to be integrated into other systems.
  • Allows users to gain insights that they may not have noticed before.

AWS Lambda Function

Uncover hidden connections and find the most semantically similar text to your search term with just a CSV file and OpenAI’s text embedding model.

Sample Response

Below is a sample response using the search term “dunkin” against the OpenAI text embedding created from words.csv. As you can see, the words ranked most similar to dunkin were donut, espresso, coffee, and milk:

{
"statusCode": 200,
"body": "{\"text\":{
\"26\":\"donut\",
\"19\":\"espresso\",
\"8\":\"coffee\",
\"10\":\"milk\",
\"22\":\"mocha\",
\"15\":\"latte\",
\"16\":\"cake\",
\"18\":\"cheeseburger\",
\"6\":\"crispy\",
\"13\":\"chocolate\"},
\"similarities\":{
\"26\":0.8761945096,
\"19\":0.8451595073,
\"8\":0.8363775793,
\"10\":0.8240465418,
\"22\":0.8208887793,
\"15\":0.8175186551,
\"16\":0.8134277628,
\"18\":0.8100024799,
\"6\":0.8084391458,
\"13\":0.8072646732
}
}"
}

AWS Lambda Code Breakdown

Imported Python Libraries

import os
import io
import openai
import numpy as np
from numpy.linalg import norm
import pandas as pd
  • os: Interacts with the operating system. It provides a way to access environment variables (e.g. OPENAI_API_KEY).
  • io: Works with input and output streams. It provides a way to read and write data to various sources, such as files, in-memory buffers, and network connections.
  • openai: Integrates with the OpenAI API. It provides a way to access various models and functionality provided by OpenAI.
  • numpy: Manipulates numerical data. It provides a wide range of functionality for working with arrays, matrices, and other numerical data structures (e.g. numerical vectors aka embeddings).
  • numpy.linalg: A sublibrary of numpy that provides linear algebra functionality. It provides functions for working with matrices, solving systems of equations, and other linear algebra related tasks.
  • pandas: Controls data in a tabular format. It provides a DataFrame data structure, which is similar to a spreadsheet, and provides a wide range of functionality for working with data in this format.

AWS Lambda Layers

An AWS Lambda layer is a way to package and manage dependencies that can be shared across multiple AWS Lambda functions. A layer is a .zip file that contains libraries, a custom runtime, or other function dependencies.

AWS Lambda layers (numpy, pandas, and openai) used within our Python 3.9 code have been uploaded to the following repo:

AWS Lambda Handler (lambda_function.py)

def lambda_handler(event, context):
openai.api_key = os.getenv('OPENAI_API_KEY')

df = pd.read_csv('words.csv')

input_term = "the fox crossed the road"
input_term_embeddings = get_embeddings_for_text(input_term)

df['embedding'] = df['text'].apply(lambda x:get_embeddings_for_text(x))

output = df.to_csv(index=False)
df = pd.read_csv(io.StringIO(output))
df['embedding'] = df['embedding'].apply(eval).apply(np.array)

search_term = "dunkin"
search_term_vector_embeddings = get_embeddings_for_text(search_term)

df["similarities"] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector_embeddings))
df_top = df.sort_values("similarities", ascending=False).head(10)
df_return = df_top[['text', 'similarities']].to_json()

return {
'statusCode': 200,
'body': df_return
}

This function is an AWS Lambda function written in Python. It does the following:

  1. Sets the OpenAI API key using the environment variable ‘OPENAI_API_KEY’ (this is stored in the AWS Console under the Lambda function’s Configuration → Environment variables)
  2. Reads a CSV file ‘words.csv’ and loads it into a Pandas dataframe ‘df’
  3. Defines a variable ‘input_term’ with a string “the fox crossed the road” and uses the function ‘get_embeddings_for_text’ to obtain embeddings for this string.
  4. Apply the function ‘get_embeddings_for_text’ on each element of the ‘text’ column of the dataframe ‘df’ and store the returned embeddings in a new column called ‘embedding’
  5. Writes the dataframe to a CSV file and read the CSV file back into a new dataframe ‘df’
  6. Converts the ‘embedding’ column from string to array
  7. Defines a variable ‘search_term’ with a string “dunkin” and uses the function ‘get_embeddings_for_text’ to obtain embeddings for this string.
  8. Applies the function ‘cosine_similarity’ on each element of the ‘embedding’ column using the ‘search_term_vector_embeddings’ variable and stores the returned similarity scores in a new column called ‘similarities’
  9. Sort the dataframe by the ‘similarities’ column in descending order and keep only the top 10 rows
  10. Selects only the ‘text’ and ‘similarities’ columns from the dataframe
  11. Converts the dataframe to json format
  12. Returns a dictionary containing status code 200 and the json data as the response body.

Helper Function: Get Embeddings from OpenAI API

def get_embeddings_for_text(input_term):
input_vector = openai.Embedding.create(
input = input_term,
model="text-embedding-ada-002")
input_vector_embeddings = input_vector['data'][0]['embedding']
return input_vector_embeddings

This function utilizes the OpenAI’s Embeddings API to create an embedding for the input term.

The first line of code creates an embedding for the input term using the “text-embedding-ada-002” model. This is done by calling the create method of the openai.Embedding class and passing the input term and model name as arguments.

The second line of code extracts the embeddings from the response of the API call and assigns it to the variable input_vector_embeddings. This is done by accessing the data key from the response object and then accessing the first element of the list, which is a dictionary. Then, the 'embedding' key is accessed from this dictionary.

Finally, the function returns the input_vector_embeddings.

Helper Function: Cosine Similarity

def cosine_similarity(A, B):
return np.dot(A,B)/(norm(A)*norm(B))

The function calculates the cosine similarity between the two input vectors.

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. It is defined as the cosine of the angle between two vectors and it ranges from -1 to 1.

The function uses the dot product of the two input vectors A and B and divides it by the product of the norm of A and the norm of B.

The dot product of two vectors is the sum of the products of their corresponding components, and the norm of a vector is the square root of the sum of the squares of its components.

The dot product of the two input vectors A and B is calculated using the numpy function np.dot(A, B) and the norm of a vector A is calculated using the numpy function norm(A)

Finally, the function returns the cosine similarity value as a scalar, which ranges from -1 to 1.

Pricing for text-embedding-ada-002

“Build advanced search, clustering, topic modeling, and classification functionality with our embeddings offering.”

text-embedding-ada-002

$0.0004 / 1K tokens

Source: https://openai.com/api/pricing/ (Jan. 2023)

If you recall from a previous post, 1K tokens is roughly 750 words.

Using the previous post’s math, we can calculate the following:

pricing for text-embedding-ada-002

Getting OpenAI embeddings for the entire King James Bible, should only cost around 42 cents!

Real World Example

Using the OpenAI text embeddings, we can already see real world applications like BibleGPT:

BibleGPT works as advertised, but have requested confirmation on the pricing as my math does not add up with the $4 estimate provided.

Conclusion

The future of OpenAI text embeddings looks promising as it continues to improve and advance in its ability to understand and generate human-like language. With the ongoing development of machine learning and natural language processing technologies, it is likely that text embeddings will become even more accurate and useful in a wide range of applications, including language translation, text summarization, and sentiment analysis.

As more data is added to the system (i.e. GPT-4), it will continue to learn and improve, becoming an even more powerful tool for businesses and individuals alike. Overall, the future of OpenAI text embeddings is exciting and holds great potential for shaping the way we interact with and understand language.

Credit to the Inspiration behind this Post

On a caffeine infused trip down a YouTube rabbit hole, I was recommended the video below (Channel: Part Time Larry), which clearly explains and uses the OpenAI Embeddings API functionality for searching financial documents via a Google Colab Jupyter notebook:

Colab Jupyter Notebook

If you watch the video, you’ll notice that I had to make some changes to get it to work in AWS Lambda — primarily, avoiding file system writes, using an environment variable, and the creation of the two custom helper functions.

YouTube Channel: Part Time Larry

--

--

Ivan Campos
Sopmac AI

Exploring the potential of AI to revolutionize the way we live and work. Join me in discovering the future of tech