Create a Serverless Search Engine using the OpenAI Embeddings API

OpenAI’s Text Embedding Model on AWS Lambda

Published in

Sopmac AI

8 min readJan 28, 2023

Get ready to take your text analysis to the next level with powerful code that uses the OpenAI Text Embedding model to transform your words and phrases into embeddings, which allow for advanced text analysis and comparisons. Ultimately resulting in semantic search that can be immediately be deployed to AWS Lambda.

General Outline

The OpenAI Text Embedding Model
Embeddings & the OpenAI API
Working Code (Python / AWS Lambda Function)
Pricing for text-embedding-ada-002
Real-World Inspirations and Resources

The OpenAI Text Embedding Model

The OpenAI Text Embedding model is a machine learning model that converts text data into numerical vectors, known as embeddings.

The embeddings are trained on a large corpus of text and have been fine-tuned to produce high-quality representations of words and phrases.

OpenAI has several versions of text embedding models with different architectures, such as the text-embedding-ada-002 model, which we will code against today.

The OpenAI Embeddings API

The OpenAI Embeddings API allows for easy access to the powerful text embedding model created by OpenAI. This means that developers can easily integrate the model into their applications and perform advanced text analysis tasks without having to train their own model or manage the infrastructure required to run it.

Additionally, the API endpoint abstracts away the complexity of the underlying model, making it accessible to developers with varying levels of machine learning expertise.

Furthermore, the OpenAI Embeddings API allows to access the state-of-the-art models and uses the latest research in the field, which is a valuable resource for many companies that would not have the resources to maintain such models in-house.

Finally, the API also allows for easy scaling, which means companies can increase their usage as their needs grow.

Note: If you want to get started using the OpenAI API with JavaScript, check out:

OpenAI API JavaScript Jumpstart

Start using the OpenAI API in your JavaScript projects today

medium.com

OpenAI Embedding Use Cases

OpenAI Embeddings can be used in a variety of natural language processing tasks, such as:

Text classification: The embeddings can be used to train machine learning models for text classification tasks, such as sentiment analysis and spam detection.
Text clustering: The embeddings can be used to cluster text into groups of similar documents, which can be useful for tasks such as document summarization and topic modeling.
Named entity recognition: The embeddings can be used to train models for identifying named entities in text, such as people, organizations, and locations.
Text similarity: The embeddings can be used to determine the semantic similarity between two pieces of text. — ✨ THE USE CASE FOR OUR PYTHON CODE BELOW ✨

These are just a few examples, the possibilities are endless depending on your project and use case.

Working Code (Python — AWS Lambda Function)

The following code provides several key values:

Utilizes the power of OpenAI’s text embedding model to analyze large amounts of text data and find the most semantically similar text to a given search term, which would otherwise be time-consuming and prone to human error.
Returns the results in an easily accessible format, a JSON object, to be integrated into other systems.
Allows users to gain insights that they may not have noticed before.

AWS Lambda Function

Uncover hidden connections and find the most semantically similar text to your search term with just a CSV file and OpenAI’s text embedding model.

openai-text-embedding/words.csv at main · IvanCampos/openai-text-embedding

Uncover hidden connections and find the most semantically similar text to your search term with just a CSV file…

github.com

Sample Response

Below is a sample response using the search term “dunkin” against the OpenAI text embedding created from words.csv. As you can see, the words ranked most similar to dunkin were donut, espresso, coffee, and milk:

{
  "statusCode": 200,
  "body": "{\"text\":{
      \"26\":\"donut\",
      \"19\":\"espresso\",
      \"8\":\"coffee\",
      \"10\":\"milk\",
      \"22\":\"mocha\",
      \"15\":\"latte\",
      \"16\":\"cake\",
      \"18\":\"cheeseburger\",
      \"6\":\"crispy\",
      \"13\":\"chocolate\"},
      \"similarities\":{
          \"26\":0.8761945096,
          \"19\":0.8451595073,
          \"8\":0.8363775793,
          \"10\":0.8240465418,
          \"22\":0.8208887793,
          \"15\":0.8175186551,
          \"16\":0.8134277628,
          \"18\":0.8100024799,
          \"6\":0.8084391458,
          \"13\":0.8072646732
      }
  }"
}

AWS Lambda Code Breakdown

Imported Python Libraries

import os
import io
import openai
import numpy as np
from numpy.linalg import norm
import pandas as pd

os: Interacts with the operating system. It provides a way to access environment variables (e.g. OPENAI_API_KEY).
io: Works with input and output streams. It provides a way to read and write data to various sources, such as files, in-memory buffers, and network connections.
openai: Integrates with the OpenAI API. It provides a way to access various models and functionality provided by OpenAI.
numpy: Manipulates numerical data. It provides a wide range of functionality for working with arrays, matrices, and other numerical data structures (e.g. numerical vectors aka embeddings).
numpy.linalg: A sublibrary of numpy that provides linear algebra functionality. It provides functions for working with matrices, solving systems of equations, and other linear algebra related tasks.
pandas: Controls data in a tabular format. It provides a DataFrame data structure, which is similar to a spreadsheet, and provides a wide range of functionality for working with data in this format.

AWS Lambda Layers

An AWS Lambda layer is a way to package and manage dependencies that can be shared across multiple AWS Lambda functions. A layer is a .zip file that contains libraries, a custom runtime, or other function dependencies.

AWS Lambda layers (numpy, pandas, and openai) used within our Python 3.9 code have been uploaded to the following repo:

GitHub - IvanCampos/openai-text-embedding: Uncover hidden connections and find the most…

Uncover hidden connections and find the most semantically similar text to your search term with just a CSV file…

github.com

AWS Lambda Handler (lambda_function.py)

def lambda_handler(event, context):
    openai.api_key = os.getenv('OPENAI_API_KEY')

    df = pd.read_csv('words.csv')
    
    input_term = "the fox crossed the road"
    input_term_embeddings = get_embeddings_for_text(input_term)
    
    df['embedding'] = df['text'].apply(lambda x:get_embeddings_for_text(x))
    
    output = df.to_csv(index=False)
    df = pd.read_csv(io.StringIO(output))
    df['embedding'] = df['embedding'].apply(eval).apply(np.array)

    search_term = "dunkin"
    search_term_vector_embeddings = get_embeddings_for_text(search_term)
    
    df["similarities"] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector_embeddings))
    df_top = df.sort_values("similarities", ascending=False).head(10)
    df_return = df_top[['text', 'similarities']].to_json()

    return {
        'statusCode': 200,
        'body': df_return
    }

This function is an AWS Lambda function written in Python. It does the following:

Sets the OpenAI API key using the environment variable ‘OPENAI_API_KEY’ (this is stored in the AWS Console under the Lambda function’s Configuration → Environment variables)
Reads a CSV file ‘words.csv’ and loads it into a Pandas dataframe ‘df’
Defines a variable ‘input_term’ with a string “the fox crossed the road” and uses the function ‘get_embeddings_for_text’ to obtain embeddings for this string.
Apply the function ‘get_embeddings_for_text’ on each element of the ‘text’ column of the dataframe ‘df’ and store the returned embeddings in a new column called ‘embedding’
Writes the dataframe to a CSV file and read the CSV file back into a new dataframe ‘df’
Converts the ‘embedding’ column from string to array
Defines a variable ‘search_term’ with a string “dunkin” and uses the function ‘get_embeddings_for_text’ to obtain embeddings for this string.
Applies the function ‘cosine_similarity’ on each element of the ‘embedding’ column using the ‘search_term_vector_embeddings’ variable and stores the returned similarity scores in a new column called ‘similarities’
Sort the dataframe by the ‘similarities’ column in descending order and keep only the top 10 rows
Selects only the ‘text’ and ‘similarities’ columns from the dataframe
Converts the dataframe to json format
Returns a dictionary containing status code 200 and the json data as the response body.

Helper Function: Get Embeddings from OpenAI API

def get_embeddings_for_text(input_term):
    input_vector = openai.Embedding.create(
        input = input_term,
        model="text-embedding-ada-002")
    input_vector_embeddings = input_vector['data'][0]['embedding']
    return input_vector_embeddings

This function utilizes the OpenAI’s Embeddings API to create an embedding for the input term.

The first line of code creates an embedding for the input term using the “text-embedding-ada-002” model. This is done by calling the create method of the openai.Embedding class and passing the input term and model name as arguments.

The second line of code extracts the embeddings from the response of the API call and assigns it to the variable input_vector_embeddings. This is done by accessing the data key from the response object and then accessing the first element of the list, which is a dictionary. Then, the 'embedding' key is accessed from this dictionary.

Finally, the function returns the input_vector_embeddings.

Helper Function: Cosine Similarity

def cosine_similarity(A, B):
    return np.dot(A,B)/(norm(A)*norm(B))

The function calculates the cosine similarity between the two input vectors.

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. It is defined as the cosine of the angle between two vectors and it ranges from -1 to 1.

The function uses the dot product of the two input vectors A and B and divides it by the product of the norm of A and the norm of B.

The dot product of two vectors is the sum of the products of their corresponding components, and the norm of a vector is the square root of the sum of the squares of its components.

The dot product of the two input vectors A and B is calculated using the numpy function np.dot(A, B) and the norm of a vector A is calculated using the numpy function norm(A)

Finally, the function returns the cosine similarity value as a scalar, which ranges from -1 to 1.

Pricing for text-embedding-ada-002

“Build advanced search, clustering, topic modeling, and classification functionality with our embeddings offering.”
text-embedding-ada-002
$0.0004 / 1K tokens
Source: https://openai.com/api/pricing/ (Jan. 2023)

If you recall from a previous post, 1K tokens is roughly 750 words.

OpenAI API Pricing in Words per Dollar

Pricing Breakdown for Your Favorite AI’s Favorite API

medium.com

Using the previous post’s math, we can calculate the following:

Getting OpenAI embeddings for the entire King James Bible, should only cost around 42 cents!

Real World Example

Using the OpenAI text embeddings, we can already see real world applications like BibleGPT:

BibleGPT works as advertised, but have requested confirmation on the pricing as my math does not add up with the $4 estimate provided.

Conclusion

The future of OpenAI text embeddings looks promising as it continues to improve and advance in its ability to understand and generate human-like language. With the ongoing development of machine learning and natural language processing technologies, it is likely that text embeddings will become even more accurate and useful in a wide range of applications, including language translation, text summarization, and sentiment analysis.

As more data is added to the system (i.e. GPT-4), it will continue to learn and improve, becoming an even more powerful tool for businesses and individuals alike. Overall, the future of OpenAI text embeddings is exciting and holds great potential for shaping the way we interact with and understand language.

Credit to the Inspiration behind this Post

On a caffeine infused trip down a YouTube rabbit hole, I was recommended the video below (Channel: Part Time Larry), which clearly explains and uses the OpenAI Embeddings API functionality for searching financial documents via a Google Colab Jupyter notebook:

Colab Jupyter Notebook

Google Colaboratory

Edit description

colab.research.google.com

If you watch the video, you’ll notice that I had to make some changes to get it to work in AWS Lambda — primarily, avoiding file system writes, using an environment variable, and the creation of the two custom helper functions.

YouTube Channel: Part Time Larry

General Resources from OpenAI

OpenAI Cookbook

This Colab notebook (referenced above) is based on the Embeddings examples in the OpenAI Cookbook:

GitHub — openai/openai-cookbook: Examples and guides for using the OpenAI API

The OpenAI Cookbook shares example code for accomplishing common tasks with the OpenAI API. To run these examples…

github.com

Create a Serverless Search Engine using the OpenAI Embeddings API

OpenAI’s Text Embedding Model on AWS Lambda

General Outline

The OpenAI Text Embedding Model

The OpenAI Embeddings API

OpenAI API JavaScript Jumpstart

Start using the OpenAI API in your JavaScript projects today

OpenAI Embedding Use Cases

Working Code (Python — AWS Lambda Function)

AWS Lambda Function

openai-text-embedding/words.csv at main · IvanCampos/openai-text-embedding

Uncover hidden connections and find the most semantically similar text to your search term with just a CSV file…

Sample Response

AWS Lambda Code Breakdown

Imported Python Libraries

AWS Lambda Layers

GitHub - IvanCampos/openai-text-embedding: Uncover hidden connections and find the most…

Uncover hidden connections and find the most semantically similar text to your search term with just a CSV file…

AWS Lambda Handler (lambda_function.py)

Helper Function: Get Embeddings from OpenAI API

Helper Function: Cosine Similarity

Pricing for text-embedding-ada-002

OpenAI API Pricing in Words per Dollar

Pricing Breakdown for Your Favorite AI’s Favorite API

Real World Example

Conclusion

Credit to the Inspiration behind this Post

Colab Jupyter Notebook

Google Colaboratory

Edit description

General Resources from OpenAI

OpenAI Cookbook

GitHub — openai/openai-cookbook: Examples and guides for using the OpenAI API

The OpenAI Cookbook shares example code for accomplishing common tasks with the OpenAI API. To run these examples…

OpenAI’s Most Recent Embedding Model Announcement

New and Improved Embedding Model

We are excited to announce a new embedding model which is significantly more capable, cost effective, and simpler to…

Official Documentation from OpenAI

OpenAI API

An API for accessing new AI models developed by OpenAI

Written by Ivan Campos