Ingeniously Simple

How Redgate build ingeniously simple products, from inception to delivery.

Demystifying Embeddings

--

Vector embeddings are a numerical representation of text in a high-dimensional space. Within that high-dimensional space, semantically similar text should be close together. So, in theory the vectors for “king” and “queen” should be closer together.

Various pretrained models exist for doing this, and writing the Python to convert a sentence into a vector is amazingly simple (thank you sentence-transfomers package!). The function below takes a list of strings and returns an array of each embedding.

An embedding (with this model at least) is a vector of size 384.

def generate_embeddings(texts):
"""Generate embeddings for a list of texts."""
model = SentenceTransformer('all-MiniLM-L6-v2')
return model.encode(texts)

What does this vector mean? Well, it’s a location is a 384-dimensional space of course. Yes, that didn’t help me much either. How can we interpret it?

One approach is to measure the “closeness” of any given embeddings. How do you find out how close two points are?

  • Euclidean distance — this is the straight line distance and simple to calculate.
  • Manhattan distance — the distance if you could only move along grid lines.
  • Cosine similarity — the angle between the vectors (the closer to 1 the more aligned, -1 is the opposite)

If we calculate these for queen, king and banana then we see what we expect. The cosine similarity is higher for related terms, and the distance is smaller.

queen <-> king
Cosine Similarity: 0.681
Euclidean Distance: 0.799
Manhattan Distance: 12.253

queen <-> banana
Cosine Similarity: 0.397
Euclidean Distance: 1.099
Manhattan Distance: 17.292

king <-> banana
Cosine Similarity: 0.395
Euclidean Distance: 1.100
Manhattan Distance: 16.981

But even this isn’t a great way of understanding it. What does the magnitude of the difference mean? Is it a lot, not enough? To visualize it properly, we need to convert that high dimensionality space into something a bit more useful.

We’ll use PCA (principal component analysis) to do dimensionality reduction (roughly, PCA projects data into a space that best captures variation in the data.

def reduce_dimensions(embeddings, n_components=3):
pca = PCA(n_components=n_components)
transformed_data = pca.fit_transform(embeddings)
variance_ratios = pca.explained_variance_ratio_

This reduces those 384 dimensional vectors into a small 3D vector. We return the variance ratios which helps us understand how much variance these principal components capture. In my case with [king, queen, orange, banana] the first two principal components capture about 80% of the variance. We can now plot these to get an idea of how separated they are.

Embeddings of some words

You shouldn’t try and draw any conclusions about what the axis represent; they are simply a mathematical construct that captures the variation for this particular set. Sometimes folks like to imagine that a particular dimension in an embedding corresponds to some human intuition about the difference between words. That’s incredibly unlikely!

So, how’s that useful?

Our first application of this is for semantic search. How do embeddings help here? Well, we don’t just have to create embeddings of single words, we can throw entire sentences in. To illustrate this, let’s write some code to index every sentence in the entire works of William Shakespeare, plonk it in a vector database and allow you to find the most similar sentences.

This sounds like it’d be a huge amount of work, but thanks to a whole bunch of Python libraries it’s about 100 lines of code (!).

class SentenceIndexer:
def __init__(self, model_name='all-MiniLM-L6-v2'):
nltk.download('punkt', quiet=True)
self.model = SentenceTransformer(model_name)
self.index = None
self.sentences = []
def load_and_split_text(self, file_path):
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()

self.sentences = sent_tokenize(text)
return self.sentences
def create_embeddings(self):
embeddings = self.model.encode(self.sentences, show_progress_bar=True)
return embeddings
def build_index(self, embeddings):
dimension = embeddings.shape[1]
self.index = faiss.IndexFlatL2(dimension)
self.index.add(embeddings.astype(np.float32))
def save_index(self, save_dir):
os.makedirs(save_dir, exist_ok=True)
faiss.write_index(self.index, os.path.join(save_dir, "sentence_index.faiss"))
with open(os.path.join(save_dir, "sentences.pkl"), 'wb') as f:
pickle.dump(self.sentences, f)
def load_index(self, save_dir):
self.index = faiss.read_index(os.path.join(save_dir, "sentence_index.faiss"))
with open(os.path.join(save_dir, "sentences.pkl"), 'rb') as f:
self.sentences = pickle.load(f)
def search(self, query, k=5):
query_embedding = self.model.encode([query])
distances, indices = self.index.search(query_embedding.astype(np.float32), k)
results = []
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
results.append({
'sentence': self.sentences[idx],
'distance': dist,
'index': idx
})

return results

On my machine (without any fancy GPU acceleration), it takes about 30 seconds to build the index. Sometimes it’s amazing how fast computers are!

Once you’ve got the index, it’s easily searchable. When a user enters some text, you calculate the embedding vector and find the nearest neighbours to that. Here’s an example:

Query: Ouch, I've bumped my head.
Most similar sentences:
1. Distance: 0.92
Lord, how my head aches!
2. Distance: 1.01
I’ll scratch your heads.
3. Distance: 1.01
Scratch my head, Peaseblossom.
4. Distance: 1.04
I have a pain upon my forehead here.
5. Distance: 1.11
ne’er pull your hat upon your brows.

Cool — semantic search in a few lines of code!

There’s plenty more applications of this. One popular one at the moment is Retrieval Augmented Generation (RAG), a technique for finding relevant context to help large language models give better answers. You feed in documents to a vector database as above, and when a user makes a query find related documents and add them into the context for the LLM.

Hopefully that’s demystified embeddings. And if it hasn’t, I’d encourage you to download some Python packages and play with it!

--

--

Ingeniously Simple
Ingeniously Simple

Published in Ingeniously Simple

How Redgate build ingeniously simple products, from inception to delivery.

Jeff Foster
Jeff Foster

Written by Jeff Foster

Director of Technology and Innovation at Redgate.

No responses yet