# Neural Network Embeddings: from inception to simple

Whenever I encounter a machine learning problem that I can easily solve with a neural network I jump at it, I mean nothing beats a morning standup other than a morning standup with a half a dozen buzzwords. I am only slightly joking about the buzzword part. The reality is that there are always times when a neural network is an ineffective, obsolete or complete overkill solution, but today I’m going to tell you a little more about a lesser known feature of our favorite topic. I am of course talking to you about our good friend the Embedding in the context of neural networks.

For those of you who have delved into the world of Natural Language Processing (NLP) you would be quite familiar with embeddings, and rightfully so, it is one of the more popular cases when embeddings can play a center stage role in an experiment. It’s actually quite amazing how such a useful and well thought out feature has such little documentation in areas outside of NLP.

In this brief article, I am going to take you through a short example of the use of embeddings, this time in a deep neural network for recommendations using the Keras framework.

**m bed huh…?**

Much like everything in the data science world, embeddings are delightfully simple in a complicated fashion. The Keras documentation is a little sparse on the topic of embeddings simply eluding to the fact that they turn positive integers into fixed size dense vectors. What does that even mean and what does that tell us about the usefulness of embeddings?

Embeddings are the best-kept secret for neural networks with multiple varying inputs. Put simply the embedding layers themselves behave in a similar fashion to their cousins, the dense hidden layers. That is to say, that the dense vectors mentioned above are weighted similarly to that of the dense hidden layers, but they are not the result of an activation function. Instead, the embedding layers are randomly weighted in the forward pass (initially) and only upon back propagation they are re-computed. The idea behind this is that the model can group similar inputs based on the outcome of the forward pass through the model.

In the above graphic, we can see that inputs which are normalized to positive integer values which allow us to specifically extract weights for embeddings at that index. This is why we are required to specify the embedding size before the model is trained as the dimensionality is set to *n-inputs *of *x-dimensional* vectors. So to put it simply, we have a multidimensional vector representation for inputs into the model.

Now you may be thinking that’s, great we have some values from our inputs and all, but what does that mean? Well, in a nutshell, it means that neural networks are not the mystical *“black boxes”* that everyone claims them to be, well at least not entirely. Embeddings give us insight into the inner workings of our network and with a little help from dimensionality reduction techniques such as Principal Component Analysis (PCA) or t-distributed Stochastic Neighbour Embedding (t-SNE) we can have an inside look into how the network is grouping various inputs. Essentially we get a little peek into what our model is thinking and doing, an insiders advantage if you will.

**How about a small example?**

We are going to keep this section sweet and short, as the focus of this article is on the embeddings themselves. This scenario is taken from work carried out in another series, whereby we created a deep neural network using Keras to recommend cars to returning users. Given the main focus of the project, the use of embeddings was not immediately apparent and at first, overlooked. We then came to realize that a simple shortfall for many recommendation algorithms is the apparent *“cold-start” *problem.

In this simplified example, we reproduce a typical collaborative filtering approach by decoding matrices of users and vehicles (listings). Please keep in mind this is only an example and not a viable production model.

The above model would produce a very basic collaborative filtering result, but what happens if we add embeddings?

Well, not much actually, the model behaves mostly the same, but we have a small extractable section of the model. This means that we can put the model to good use and find similar vehicles. Using the vehicle ID’s we can find a high dimensional set of values in the embedding, in this case, 32 dimensions. We can then use cosine similarity, Euclidean distance or any other vector similarity measurement approach to find the closest matching listings.

The outcome of the above example is quite rudimentary, we are simply able to find listings that have a causal relationship given user interactions, the idea of “users who looked at this also looked at…”, but what if we took it a step further and added a few features belonging to vehicles?

**So what about the real world?**

We had a brief and simple example of embedding user and vehicle (listing) features to replicate collaborative filtering, but at heycar, we developed something a little deeper than that:

So what are we looking at? Well, we have a model with our same old user and listing inputs, but now we also have a couple more inputs belonging to listing features, things such as price, location and make. We also have a slightly deeper model, this is because we removed the dot product of our 2 inputs and replaced it with a few more dense layers.

Using the same approach we discussed earlier we examined the embeddings and using some dimensionality reduction and examining cosine similarity we obtained a highly scalable method to match similar listings.

For example; let’s say we look for cars similar to a 2018 Nissan Qashqai, with a price of € 23,221 and its location in Dresden, as seen below:

Now if we examine the inferred similar listings our top 3 results are:

Now the hard similarities are quite rational, however, the most interesting thing is the location, this is not merely based on the provided location (Dresden), but it is in fact inferred from the user’s searches and their search range while on the site. The results include listings from within Dresden itself as well as towns and cities around Dresden all within 1 hour travel time and less than 100km.

The above model consistently produces similar results in a highly scalable production environment. We naturally tweaked the model and approach to match our changing infrastructure, but the end result is roughly the same.

# Food for thought

After bashing our heads enough times around the topic of neural networks and particularly the way in which dense hidden layers and their close cousins the embedding layers work, we begin to unbox the inner workings of neural networks. We also discovered that using embedding layers we can not only find some useful correlations between different inputs but also group inputs based on some implied similarities deduced by the neural network itself.

**Thanks** for reading our article!

By the way, if you want to work with our highly skilled engineering and data science teams or any related teams, take a look at our careers page.