Building a Recommender System for Data Enrichment

Yaniv Goldfrid
Explorium.ai
Published in
8 min readJun 24, 2021

What are Recommender Systems?

You’ve seen it everywhere. Every big tech company is now using some kind of Recommender System in their platform. Facebook suggests friends, Netflix recommends movies, same with Youtube and videos and Amazon recommends… well everything. Why do they do it? Well, every case is different but in general, good recommendations mean a better user experience and they will find new ways to keep users engaged and satisfied with the service. Happy users equal recurring users.

This is not an ad, it’s Netflix recommending TV shows :D

But should we implement a recommender system just because all the cool kids are doing it? We need to think about our business needs and determine whether or not a system like this will add value to our service. At Explorium we provide our users with thousands of data enrichment signals to enrich their predictive models and advanced analytics, but sometimes it is difficult to find the most relevant data enrichment signals among the variety and volume of external data. In order to help organizations identify the data signals with the highest impact on their analysis and predictions, we built a Data Enrichment Recommender System. This article explains how the recommender system was designed and built.

Implementing a Recommender System

First, let’s talk about the theory behind Recommender Systems. There are many different approaches:

  1. Content-based: As the name states, these recommendations are based on the content of the product we are recommending. For example, if a user enjoys mystery novels, we are going to recommend more mystery novels to them.
  2. Collaborative Filtering: In this case, the user's preferences are stored (for example as ratings) and then we find similar users in terms of their preferences. That way if we know two users are alike, we can recommend a product one of them enjoyed to the other.
  3. Hybrid: A mix between the two, recommending products based on their content and also based on similar user’s preferences.

In this case, I used the Collaborative Filtering approach.

Let’s take the simplest example of the Collaborative Filtering method. Let’s say we have a matrix where rows are users and columns are books. In each cell, we have a value between 1 and 5 which represents the rating a user gave to a book. If the cell is blank it means the user didn’t read that book.

Here our goal is to predict the missing values in the matrix

What we want to do here is recommend books that people will enjoy. In other words, we should recommend a book to a person only if that person would give a high score to that book after reading it. But how can we know the score a person will give to a book? We need to predict it with the information we have. If this sounds like a Machine Learning task it’s because it is.

Matrix Factorization to the rescue

The 2 most common techniques to deal with an issue like this consist of decomposing the matrix into smaller matrices and then performing some extra calculations with them:

  • Singular Value Decomposition (SVD): Both for negative and positive values. Useful for calculating the Eigen Vectors. It doesn’t work with missing values so value imputation is needed which might bring extra noise. One classic example is PCA
  • Non-negative Matrix Factorization (NMF): Only positive values. It works great with sparse matrices since the missing value assumption is built into the algorithm.

In our problem, we have a big sparse matrix and we need to predict the missing values. So naturally, it makes more sense to use the NMF technique.

Remember in elementary school when you learned about basic factorization? I’ll refresh your memory. It consists of finding the factors of a number such that when you multiply them, the result will be the original number. For example, if we factorize 15 we will get 3 and 5.

Now, matrix factorization follows the same principle. We need to find the factors of a matrix. The result will be two matrices such that when you multiply them, the result will be the original matrix.

We need to find the embedding matrices

The resulting matrices are called Embedding Matrices. The first one will be an embedding matrix for the Users and the second one an embedding matrix for the Books. An embedding is just a vector representation of something.

Once we have the embedding matrices the problem is solved. We need only to store them and whenever we want to predict a score we multiply the respective row and column.

Now a more complex example: Data Enrichment

After understanding the User/Book example we can apply the same concept to our business problem. But first, we need to define 2 things here:

Project: Simply a CSV file (or multiple CSV files linked together), so you can simply think of it as a table; rows and columns.

Enrichment: Given a specific column, a data enrichment will match additional data according to the column type, shown as extra columns in the original table.

So for example, a project can be a table of companies where the columns are company_name, website, address, etc. And an enrichment for the company_name can add some extra columns such as phone_number, business_owner, etc.

Companies project
Companies project after using a specific enrichment

Our “Users” will be the projects and our “Books” will be the actual data enrichments. We will recommend enrichments to a project analyzing similar projects and see what worked for them.

But what should we use as the score? This is the first question we should ask ourselves. In our basic example, a human gave a rating between 1 and 5 to a book. A very clear way of knowing the score. But now we have projects and enrichments. Projects cannot do anything, they cannot talk, walk or fall in love, and they certainly cannot give 5 stars to an enrichment even if it really improved their performance (selfish projects). So we created a “relevance” score. This score is a counter to how many features of an enrichment made it to the best model of a project. In other words, it measures how relevant an enrichment is for a project.

Relevance Score Table

Modeling

Once we have our data we need to build our Machine Learning model. As you can suspect, this is a Regression problem. We have the scores which many projects gave to many enrichments, and we want to predict the ones that are missing. There is a classic way of modeling this problem using a Neural Network and this is what we did here:

Neural Network Model for Recommender Systems

There is one hyperparameter we need to tune, which is the size of the embedding (K). It could be any number but usually, a number between 20–50 gives good results. If we have N projects, our ProjectEmbedding layer will be of size N x K and if we have M enrichments, our EnrichmentEmbedding will be of size M x K.

I built this Neural Network using Tensorflow, but you can do it with PyTorch as well, there is no difference really.

After we create our model we need to train it of course. I did a very straightforward training, 100 epochs with Early Stop on the Validation Loss to prevent overfitting.

Train and Validation MSE during training

Predicting

After training our model the only thing remaining is to extract the Embedding Matrices (both for the projects and the enrichments) and use them for our predictions. If you are wondering what an Embedding Matrix looks like, here I’ll show you one, keep in mind it’s not very exciting, it’s just a bunch of numbers on a table. But we don’t like it because it looks cool, we like it because it’s useful for predicting scores (and thus, making recommendations)

Embedding Matrix for Enrichments

We can do many things with the Embedding Matrices now. For example, we could do some Principal Component Analysis (PCA) and try to cluster together some of the Enrichments that are similar to each other. Or we can calculate the Cosine Similarity numerically check which Enrichment is the most similar to a specific one.

Here you can see I used Tensorboard to visualize both things. We can see on the PCA plot how all OSM (open street maps) enrichments are clustered together and their cosine similarity proves they are very similar to each other.

PCA and Cosine Similarity on the Enrichment Embedding Matrix

As you can see our model did a pretty good job finding similarities between enrichments that we know are supposed to be similar. And now we are confident that it will perform well when actually recommending enrichments to a project. In order to do so we need to:

  1. Choose the Project we want to get enrichment recommendations for
  2. Find its respective vector in the Project Embedded Matrix
  3. Perform a matrix multiplication between the vector and the entire Enrichment Embedding Matrix
  4. Sort the results
  5. Choose the top 5 (or the number of recommendations you want)

Conclusion

Recommender systems are incredibly useful for improving user experience and keeping users engaged. Nevertheless, this is not the only use for this kind of system. In this article, I showed how even in situations where recommendations are not directly for the users, we can apply the same concepts to solve specific business problems we may face. With this personalized implementation in Explorium, we can save a lot of time, money, and effort during the data enrichment process of its platform. This can truly revolutionize the way we tackle these kinds of problems and take us to the next level of the Auto-ML world.

Yaniv Goldfrid is an expert data and machine learning developer who is a member of the data science team of Explorium. Explorium offers the industry’s first automated External Data Platform for Advanced Analytics and Machine Learning. Explorium empowers data scientists and business leaders to drive decision-making by eliminating the barrier to acquiring and integrating the right external data and dramatically decreasing the time to superior predictive power. Learn more at www.explorium.ai

--

--

Yaniv Goldfrid
Explorium.ai

Data Science & Machine Learning 📊 BSc in Computer Engineering 🎓 Full Stack Developer 💻 AI Human 👽