The What and Why of Embeddings?

Hardik Bhati
Nybles
Published in
7 min readJun 27, 2020

Learning a new language ain’t easy, is it? The idea of learning a new language is almost always a pain. Just the same I can’t expect my laptop to learn what I speak that easily. All we can do is, make our computers learn to establish a particular relationship between the words we speak.

“You shall know a word by the company it keeps”-J.R.Firth

Why do we need embeddings at all?

Take this little conversation, where I am trying to teach my computer to speak English ..(for simplicity I am going to show you the conversation in English as I don’t think, unlike me, you are that comfortable with binary)

Intel Core i5- Hey buddy, why don’t you teach me your language!

Me- Umm, let’s start with this, I am gonna teach u some words. Alright?

Intel Core i5- Yups! That would be great.

Me-Well, take this Apple ->[1 0 0 0] , Blackberry - > [0 1 0 0] , Raspberry->[0 0 1 0], Django->[0 0 0 1]

Intel Core i5- (Demonic Laugh)Oh boy, just 4 fruits and that’s your language!

Me- Sorry but there are 171,476 more, and with an average of 14 words per sentence, that gets to 1.6837782655e+73 permutations possible for every sentence. And btw, I didn’t mean the fruits, the companies actually and a random web framework.

Intel Core i5- You do know you are horrible at this shit, right?

Who wouldn’t be ashamed after failing so horribly? Maybe let’s try a better approach where there is actually a semantic to what we tell the computer instead of just making it rote memorizing. Embeddings are the beacon of hope, it can help in some amazing tasks such as writing a novel for us, voice recognition, face recognition, and there is an endless list to it. We will take a look at it at the end.

But what are embeddings?

Embeddings are low-dimensional, learned continuous vector representations of the things we want to make a computer understand.

Keeping the Doc away!!

Take, for example, this apple, from a practical viewpoint, it’s like how would you explain what an apple is to another person who has never seen one?

One feasible way is by using references to what the person has seen. Like, maybe tell him that it tastes like a pear, and maybe show him a picture. Notice that without actually giving him the apple you tried to tell a person who has never actually had an apple before a feel for it.

There is semantics to what we all speak, we don’t just speak random words, we have chosen a set of names and there is a meaning associated with it. Just this way, when we use a different set of rules to represent a word, there should be some meaning behind it instead of just randomly messing around.

Embeddings are similar, we represent things by vectors. And there is a clever way around to show the similarity between words, the euclidean distance between the two vectors is a measure of how similar two things are.

Say, I choose to make an embedding for how good a fruit is, what I need to decide is what should be its dimensionality, i.e. how many features I want to add? If I want to make an overly simple model for how good a fruit taste I might want to choose just 1 dimension (how good it tastes). So my embedding go something like this, Apple->[0.8], Mango->[0.9999], Grapes->[0.9],Watermelon->[0.75]. But we do get a feel, on how it does not always provide a good picture. We know the size of the fruit also matters. Therefore, I decide to add another dimension to better quantify that, so Apple->[0.8,0.5], Mango ->[0.9999,0.6], Grapes->[0.9,0.1] , Watermelon->[0.75,0.9]. Similarly, others might want to add another dimension for popularity and another for availability.

As we can observe, there is no fixed embedding and no fixed choice for dimensionality, they can vary depending on one’s need.

Is it just limited to Apples and Mangoes?

Not at all, embeddings are used in several fields of deep learning.

Word Embeddings- A word embedding is a learned representation for text where words that have the same meaning have a similar representation. Just like our fruits example, using word2vec, the word embedding developed is of so much help in NLP tasks. Like word completion, predicting the next word, “Hey Google” Google Assistant, and the list goes on. The word embedding is so powerful that it has developed powerful semantic relations between word representation. For example, if we do this, King vector-Man vector+Woman vector-> the closest Euclidean distance points to the Queen vector.

There is no King and Queen here!

Face Embeddings- The Face Embedding model analyzes images and returns numerical vectors that represent each detected face in the image in a 1024-dimensional space. One can understand why we need a 1024 dimensional space to represent the vector space. Our face is not any simpler, further deduction in dimensionality might not result in a meaningful representation itself like it might not be able to recognize the difference between Robert Downey Jr and me. You don’t want that right?

Further upon training the model on celeb images, and marking which of them were smiling and which were not, what we can do is try to find the mean of embedding vectors for models that are smiling and from any particular model vector which is not smiling, try to move a bit towards it to make the model smile as in this picture.

Adding and Subtracting features from celebs to make them cringy

Speaker Embedding- We recognize the voice of the person we know. There is a pattern in our voice that we recognize so beautifully. But how do we tell computers how to identify a particular voice? Super easy, barely an inconvenience, we tell them by not telling them, they perfect the art upon listening to hrs of speeches and learning how to recognize a pattern in a voice. They upon listening to a voice try to reduce it to a lower dimensionality where one can clearly see those different speeches of the same person are clustered together.

Different color circles represent different speakers and triangles are medians for all the speeches for a particular person

Just by measuring the distance from the mean, one can design a fairly good speaker recognizer. The best part of it all is that all the pre-trained models are available for free on the internet.

But how to choose the number of dimensions for embeddings?

Think of it something like,

If the dimensionality is big, then the numbers can shift around easily, for example, if an apple and a car have the smallest Euclidean distance before training they can easily shift around without causing much disturbance to the other things around but whereas if things get congested, then it will be a kind of one way, where much traffic needs to adjust at the same time.

Traffic spreading like Corona Virus

If there are multiple lanes, less traffic and hence it’s easy training the model.

Social Distancing

But just as for some 100 cars we don’t make a 100 lane highway, we don’t choose so big a dimensionality that it starts affecting the memory space.

This can give rise to this pirate-like phenomenon “The curse of dimensionality”. Having too many features just means over-fitting your data massively. In layman’s terms what happens is every observation in your dataset appears equidistant from all the others. Moreover, if you try and add a new celeb to your over-fitted celeb embedding, the new celeb upon adding to the data will create a huge sweep in the embedding space since it will have to change its understanding of the celeb faces. Here’s a fantastic blog post to spread some light.

There has always been a trade-off between time efficiency and memory efficiency. Priorities matter. For word embeddings, a good rule of thumb is the 4th root of the number of categories.

Conclusions

Embeddings provide an easy way to work with discrete data. Tensorboard provides an effective way for easily visualizing the embeddings when the model is being trained. Try experimenting, it’s an awesome time to be alive.

It is quite easy to use word embeddings on Keras, I highly recommend checking out this link.

About Me

Undergrad at IIIT-A, AI enthusiast and a happy learner!

--

--