In co-authorship with Egor Yurtaev
GitHub is one of the biggest software development platforms and the home for many popular open source projects. If you are curious about which projects are trending these days, you may find the Explore GitHub page useful. There you will find projects that are gaining popularity right now, repositories recently starred by the people you follow, and popular projects combined by a topic or programming language.
What you won’t find is project recommendations based on your previous activity. This is surprising because users give stars to many projects every day and this information can be easily used for producing personalized recommendations.
In this article, we would like to share our experience with building the recommender system for GitHub named GHRecommender from the idea to implementation.
Every GitHub user can give a star to a project she likes. Having the information about stars we can find similar users and recommend them to check out the projects that already got a star from users with matching tastes. And vice versa — we can find projects alike projects that a user starred and recommend them to her.
Thus, the idea can be formulated as follows: get the data about the users and stars, apply collaborative filtering algorithms to the data to build recommendations, and wrap everything into a web app in order to provide access to the recommendations for the end users.
GHTorrent is an awesome project that collects data received through the public GitHub REST API and makes it available as monthly MySQL dumps which can be found on the GHTorrent Downloads page.
Inside every dump, you can find a SQL file that describes the database schema and several files in the CSV format representing the content of the tables. As it was mentioned in the previous section, our approach was to use collaborative filtering algorithms to build recommendations. This implies we need to have the information about users and their tastes or, rephrasing it in the GitHub terms, users and projects they starred.
Luckily, the dumps have all the necessary information in the following files:
- watchers.csv contains the list of repository ids and user ids who starred them
- users.csv maps user ids from watchers.csv to user GitHub names
- projects.csv does the same for the projects.
Let’s take a look at the data we have. Here is the head of the watchers.csv file (the column names are added for convenience):
You can see that the project with the id=1 was starred by the users whose ids are 1, 2, 4, 6, 7, …
That’s a good start, but before we proceed with modeling, it is important to explore the data and clean it up if necessary.
An interesting question that immediately comes to mind is: “How many stars does a typical user give?” The histogram of the number of stars per user is shown below.
Hmm… This doesn’t look very informative. It seems like the majority of the users has a very small number of stars but some users are very active and have, approximately, 200k stars (wow!) Visualizing the dataset as a boxplot confirms this guess:
As we thought: one of the users has given more than 200k stars. Meet our hero — the user 4148. At the time we are writing the article, the page returns the 404 error. The runner-up by the number of stars has a self-explanatory name StarTheWorld and almost 46k of stars (the user’s page returns the 404 error too). There are also other outliers with more than 25k stars.
Let’s go back to the histogram and apply the logarithm transformation to the data to make the distribution of the per-user number of stars random variable more obvious:
It’s clear now that the variable follows the exponential distribution (finding its parameters can be a good exercise). The important observation is that approximately half of the users have starred less than 5 projects. This will come in useful when we start to train our model.
Let’s move to the repositories and take a quick look at the number of stars per repository. The boxplot is shown below:
Again, we have one noticeable outlier — the freeCodeCapm project with more than 200k stars!
The histogram of the log-transformed stars per repository variable given below demonstrates that the variable follows exponential distribution next to the stars per user variable but has a steeper descent. As you can see, only a fraction of the repositories has more than 10 stars:
In the end of this section, we would like to share an interesting observation. Some extremely popular repositories from the dataset have zero stars if you examine them on GitHub. We don’t believe that the data is corrupted. Instead, we suppose that this discrepancy is the result of the GitHub fight against cheat stars.
In order to understand what kind of data pre-processing should be done, we need to take a closer look at the collaborative filtering algorithm we used.
Many of the collaborative filtering algorithms are based on a user-item rating matrix factorization. Factorization allows us to infer hidden item parameters and user reaction to them. Knowing the parameters of an item the user hasn’t rated yet we can predict if she will like it based on her tastes.
In our case, we have an m × n matrix where each row represents a user and each column represents a repository. r_ij equals to 1 if the ith user has starred the jth repository.
The matrix R can be easily built using the watchers.csv file. But let’s recall that the number of stars per user is typically very small. What kind of information about the user’s tastes can we infer knowing just a few facts about she? Pretty much nothing. One simply can’t make an educated guess about your tastes knowing the only item you like.
At the same time, these “one-star” users can affect the predictive power of a model adding unnecessary noise. That’s why it’s a good idea to exclude the users we have little information about from the dataset. Our experiments showed that excluding users who rated less than 30 repositories gives reasonable results. The best thing we can do for the excluded users is to provide recommendations based on project popularity, and this is the way we did it.
Measuring Model Performance
Now let’s discuss a very important question about measuring the model performance. We have tried the following metrics:
- precision and recall
- root mean squared error (RMSE)
and find out that the precision and recall approach is not very helpful in our case.
But before we start, it’s worth mention about one very easy and effective method of measuring performance — building the recommendations for yourself and subjectively evaluating how good they are. Though it’s not a method you would like to put in your Ph.D. thesis, it makes it easy to troubleshoot problems on the early stage. For example, building recommendations for ourselves using the full dataset showed that the recommendations we got were not relevant. After removing users with a few stars we noticed a significant improvement.
Going back to the precision and recall metrics. In plain English, the precision of a binary classification model can be defined as the ratio of the number of true positive predictions to the number of the total predictions made. This can be written as:
So basically precision is how many times we hit the mark out of the number of tries.
Recall is the ratio of the number of true positive predictions to the total number of positive samples in the dataset. This can be written as:
In terms of our task, precision can be defined as the fraction of the repositories in recommendations that the user actually starred. Unfortunately, this metric doesn’t give us a lot because our goal is not to predict what a user already starred but what she likely will star. It may be the case when a project has parameters that make it a good candidate for been starred by a user and the only reason it hasn’t happened is she hasn’t seen this repository yet.
There is a modification that can make the precision metric useful in our case: if we could measure how many of recommended projects given to a user she eventually starred and divide this number by the number of recommendations, we would get the accurate model’s precision. Sadly, we don’t collect any feedback at the moment, so precision is not how we want to measure the model’s performance right now.
Recall, in respect of our task, is the ratio of a number of repositories in recommendations that a user already starred to the total number of repositories starred by the user. As in the case with precision, this metric is not that useful because the number of recommendations we query from the model is fixed and relatively small (we show 100 top recommendations) and the recall value is close to zero for users who starred many projects.
Taking into account the thoughts written above it was decided to use RMSE as the main model performance metric. In terms of our problem, RMSE can be written as:
In other words, we measure a mean squared error between the rating a user gave to a repository (0 and 1 in our case) and the predicted rating which will be close to 1 for the repositories the model thinks to suit a user well. This measure shows how closely our matrix decomposition approximates the original matrix R. Note, that w term equals 0 for all r_ij = 0.
Putting All Together — Modeling
As it was mentioned above, the recommender system we made is based on matrix factorization — more specifically on the alternating least squares algorithm (ALS). However, according to our experiments, any matrix factorization method will do the trick (SVD, NNMF).
ALS is available in many machine learning packages (see, for example, the implementation for Apache Spark). The algorithm aims to decompose the original m × n matrix to a product of two matrices of sizes m × k and n × k:
The matrices U and P represent users and projects respectively. To get the predicted rating of the jth project given by the ith user we need to multiply the ith row of U by the jth row of P. Cosine similarity of project features, in turn, can be used for finding related projects.
The parameter k defines the number of hidden project features we are trying to find. The value of k affects model performance and should be selected using cross-validation. The picture below shows the dependence of RMSE for the validation dataset on k. The value of k=12 minimizes the RMSE and was used in the final model.
The summary of the pipeline we have:
- Load the data from watchers.csv and delete all the users who have given less than 30 stars.
- Split the data into the train and validation sets.
- Choose the ALS parameter k using the RMSE metric and validation dataset.
- Train the final model on the full dataset using ALS with the found number of factors equals k.
Now, when we have the model we are happy with, it’s time to discuss how to make it accessible for end users, in other words how to build a web app around it.
Here is the list of features our backend had to provide:
- authorization through GitHub for obtaining user GitHub names
- REST API to send requests like “give me 100 recommendations for the user with the name X”
- collecting information about user subscriptions on updates
We wanted everything to be done fast and simple. According to our experience, the Django + Django REST framework stack meets our needs — tons of batteries and the good documentation.
In order to handle requests for recommendations correctly, we need to store some data received from GHTorrent. The reason is GHTorrent uses their own identifiers for users that are not the same as ones GitHub uses. So we need to keep the “user id ⭤ user GitHub name” mappings. This is also true for the repositories identifiers.
Having 20 and 64 million users and repositories respectively, we decided to try MongoDB with the new type of storage with compression to save some money on hardware.
There are two collections in MongoDB: users and projects.
The documents in users collection look as follows:
and indexed by the the field login to speed up recommendation requests handling.
The example of a document from the project collection is shown below:
As you can see, zlib compression, the method we used, makes disk usage almost 50% more effective. One of our concerns about the compression was query performance but benchmarks showed that the request processing time varies within the margin of error. More information about the influence of compression to the performance you can find here.
Summing up, MongoDB with compression gives a significant gain in the effectiveness of disk space usage. Another advantage is easiness of scaling — it will be very useful when the amount of data becomes so large that it will no longer fit on a single server.
Model in Action
There are two ways of using our model:
- Generate the recommendations for each user beforehand and store them in a database.
- Generate recommendations on demand.
The advantage of the first approach is that the model won’t be a bottleneck (the model we use can handle 30–300 recommendation requests per second). The main disadvantage is the amount of data we need to store in this case. There are 20 million of users. If we were to store 100 recommendation for each, we would end up storing almost 2 billion of records! By the way, only a fraction of these 20 million users will ever use our app, so most of this data will be useless. Last but not least — building this amount of recommendations takes time.
The second approach advantages and disadvantages are the mirrored advantages and disadvantages of the first one. But what we like about the on-demand recommendations is flexibility: we can request as many recommendations as we need and the model can be easily replaced with a new one.
Having weighed the pros and cons, we chose the second option. The model was packed into a Docker container and accessible via an RPC call.
You can get your GitHub recommendations on the GHRecommender website. Note that if you haven’t starred 30 repositories yet, you will get popular repositories as recommendations.
The GHRecommender sources are available here.