AI for web devs part 1: Introduction to Vector Embeddings

Duncan maina
3 min readAug 20, 2023

--

Vector embeddings have emerged as a cornerstone in the world of machine learning and natural language processing. They’re used in a variety of CS disciplines including: Natural language processing (NLP), Recommender Systems, Search engines & Computer Vision. Most people will come across this technology when using text models like ChatGPT, which under the hood utilizes embeddings.

The reason vector embeddings are so effective is because they map the relationship between entities in data. By doing so they capture the underlying meaning of the original content, enabling programs to recognize similarities & differences. In a nutshell Vector embeddings capture the relatedness of data.

How Embeddings Work

Lets say we’d like to capture how similar a set of words are to each other. we can give each word pair a score between 0 and 1.

A score of 0 means the two words have absolutely nothing in common and a score of 1 means the words are exactly the same.

Given the following set of words [ Vehicle, Couch, Chevy, Chair, Car, Ford, Table ] here’s how we might score the relatedness to the first word Vehicle.

Vehicle — Car => 0.95

Vehicle — Ford => 0.87

Vehicle — Chevy => 0.85

Vehicle — Chair => 0.20

Vehicle — Couch => 0.15

Vehicle — Table => 0.10

We can then take the scores for all of the words, group them and plot them. The words which are similar are closer to each other on the plot and words which are not similar are further apart.

Another way to describe those values is that they are the distance between the points on the graph. Think of it like a map, where closely related words are neighbors!

How are Vector Embeddings Generated?

Vector Embeddings are generated by machine learning algorithms which have been trained on relevant data. A model for text will have been trained on human language. A few examples of these models include:

  • Word2Vec: Represents text by mapping words to vectors based on semantic similarity.
  • GloVe: Represents text by mapping words to vectors using word co-occurrence statistics.
  • FastText: Like Word2Vec, but also considers word context in sentences.
  • BERT: A newer method using deep learning to produce vector embeddings from vast text data.

You can take a look at the Word2Vec model by visiting https://projector.tensorflow.org. The website offers a visual representation, making it easier to understand how words are related to each other in a virtual space.

(Visualization of the Word2Vec 10K model)

Final Thoughts

As you can see the Word2Vec model is a lot more complicated than our simple example. Where as that had two dimensions plotted in the x & y axis. Word2Vec has 200 dimensions! And as we’ll see later in part 2 of this series, models can get even more complex. But the same principles apply.

In part 2 of this series we’ll explore how to generate Embeddings using our own custom data. Giving us the ability customize the responses from LLMs like ChatGPT.

--

--