Building a Generative AI Model with Markov Chains: “I love you” = [1,4,3]

Part 4

Context, Context and Context…all we have been talking about the previous 2 blogs. Understanding circumstances and background is absolutely necessary to make meaningful sentences. In linguistics, context refers to the surrounding words and phrases that influence the meaning of words or statements.

Last blog, we saw how the Viterbi algorithm could help us solve this issue. But, for the real world applications, if we persist to move forward with Hidden Markov Models to build our LLMs, we would lose out on expressiveness ie., our model can not understand long range dependencies, can not capture broader context, and handle complex language patterns. Further, scalability with this method is poor, and data complexity for current use cases in the industry require more powerful methods. Hence, we build complex machine learning models and neural networks for our LLMs, but these models do not understand words in text form, but only in mathematical representation (such as a number, vector, matrix etc).

Let me give you a more intuitive understanding on why numbers/vectors are required to represent words. Let us consider a situation of sentiment analysis, where given a sentence our model needs to predict whether the sentence is positive or negative. Let’s assume that our model is already trained. Now, given the sentence “The Taj Mahal is beautiful” we expect the model to predict “positive”, but this requires the word “beautiful” to be in the training set. If the model has never seen the word “beautiful” while training and has only seen words “pretty”, “attractive” and “handsome”, then our model won’t be able to assign any label to the sentence.

By using vectors to represent words, we can overcome this challenge. For instance, if “beautiful” is represented by the vector [10,29,67], the model can still identify it as positive by comparing it with similar vectors of words it has encountered before — such as “pretty” [9,28,70], “attractive” [10,25,66], and “handsome” [8,29,69]. This numerical representation helps the model generalize to new, similar words and make accurate predictions even when encountering previously unseen terms.

There are multiple ways to represent a word mathematically. We could use one hot encoding, where we represent our data (words in our case) in binary vectors. Let’s say we have 3 words cat, dog, and fish, then we represent cat as (1,0,0), dog as (0,1,0) and fish as (0,0,1). We could probably use the count based vectorization technique of representing a phrase/sentence as a vector sized as large as our vocabulary representing the counts of each word occurring in the sentence.

Unfortunately, these methods are not efficient due to:

a) The size of our vector produced by these methods depends on the size of our vocabulary. The more words in our vocabulary, the larger our vector would be to represent a single word/sentence.

b) We can not represent any new word which we have never seen before while training our model.

c) They do not take any “context” into account.

These are some of the main motivations for the development of Word Embedding techniques for developing LLMs.

Word Embeddings are nothing but a method to represent a word or a phrase in the form of a number or vector (we are just trying to capture the meaning of the word in math format). Good word embedding techniques captures:

  1. Semantic Similarity
  2. Analogical Reasoning
  3. Contextual and Hierarchical Relationships
  4. Low dimensionality in representation (Dense Vector representation)
  5. Cross Linguistic properties.

Here, we will dive deeper into answering What makes a word embedding technique good ?

Small but Mighty

There are two types of vectors to represent pieces of text. Sparse vectors and Dense Vectors. Which is better ?

“Kevin flew from Delhi to California” and Kevin flew from California to Delhi” are different sentences with different meanings. But, when represented in the sparse vectors, we would get a close or nearly perfect match between the two vectors. Whyyyyy? This is because sparse vectors focus on word occurrence and position, not on the actual semantic meaning of the sentences.

Dense Vectors, on the other hand, capture the semantic meaning of words in a lower-dimensional space. They represent text with dense, information-rich vectors that encode contextual and semantic relationships between words. Thus, the dense vectors for the two sentences would be quite different, reflecting their distinct meanings.

Dense vectors are preferred over sparse vectors because :

  1. They are smaller in dimension
  2. More information-rich, and
  3. Computationally more efficient for processing.

This allows computers to perform calculations faster and more accurately, enhancing the effectiveness of natural language processing tasks.

Comparison of sparse (left) and dense (right) vector representations: highlighting the efficiency and richness in capturing word meanings.”

Word Neighbors

What do similar words mean? Words used in the same context or around the same set of words are called similar words. For instance, the words “dog” and “cat” are similar words as they are used around the words such as “pet” and “animal”. Even though the words “hat” and “mat” may spell closely to “cat”, they are not similar words. We need some embedding technique that acknowledges this “Semantic Similarity” among words.

So, if we were to plot these words on a graph using vectors or any numerical representation technique, then our words “dog” and “cat” must be close to each other, whereas “hat” and “cat” must be far away from each other.

Image depicting the semantic similarity between words, where closer nodes represent words with higher contextual and meaning-based similarity, illustrating relationships within the language model

In practice, semantic similarity is often calculated by cosine similarity, which measures the cosine of the angle between the two vectors. Higher the cosine value more closer (similar) our words are.

Word Mathematics

What do you think Driver — Car + Aeroplane represents? It is a funky way of representing a “Pilot” in a math equation. By removing the Car from the Driver, you’re left with the concept of operating a vehicle without specifying the vehicle. Adding Aeroplane suggests transitioning to a different type of vehicle. Therefore, the role that operates an Aeroplane is a Pilot.

Good word embeddings should allow us to perform vector operations on words, in simple words, we should be able to do math with words.

Image demonstrating how vector operations capture analogical reasoning and relationships between words in a semantic space

This operation leverages the fact that the embeddings capture semantic relationships, so the transformation from Driver to Pilot can be understood through vector arithmetic.

In practice, this technique of word-mathematics will help our model in finding synonyms and antonyms of words. It helps our model to understand relationships with words too, like occupation and studies, vehicle and size etc. Advanced models like BERT and GPT-3 use contextual embeddings that adapt based on usage of words. By understanding context, these models can distinguish between different meanings of a single word in various scenarios. Hence, we want our embedding method to have the ability to perform word mathematics.

Now that we covered the fundamentals of effective embedding techniques. Next blog, I will bring to you some popular word embedding methods used. Till then stay tuned. Welcome — Start + End = Bye !!!!

--

--

Venkatachalam Subramanian Periya Subbu

Aspiring Data Scientist with love for research and teaching. Sharing insights on Data Science, AI, and the latest trends from the AI capital of the world-SFO.