In 2020, the Machine Learning team at Hootsuite partnered with the Amplify team to create the Post Recommendation feature: a system that recommends social media posts to share based on what content a user has shared in the past. The goal was to increase the amount of content shared through Amplify and make it easier to find engaging content for our users. This post outlines the techniques used and the system architecture for this feature. As well, the performance in production will be discussed. Below in Figure 1 is a screenshot of the front facing design of the recommendation section.
Amplify is a product made by Hootsuite that allows companies to increase the reach of their content on social media by sharing it through the social media accounts of their employees. Employees have access to a dashboard of the company’s content and can choose to share it to their own networks. One of the current issues that the amplify team is trying to solve is that some companies produce a lot of content that may or may not be relevant to each user’s tastes. You wouldn’t want to share a post out to your network about, for example stock information, if your personal network expects you to always be posting articles about awesome rock bands such as Led Zeppelin, or Foo Fighters.
A popular technique for content recommendation is called collaborative filtering. Collaborative filtering can be summarized with the following example: user A shares post 1, and user B shares post 1 and post 2. Since user A and user B both shared post 1 it can be assumed that they have similar tastes. Therefore, it would make sense to recommend post 2 to user A. The real implementation details for collaborative filtering are a little more sophisticated, and will involve different ways of scoring each posts and how similar users are based on their posting habits, but that is the general gist.
The specific type of collaborative filtering that we chose to use in this project is Singular Value Decomposition (SVD). SVD gained popularity after it was used to win the 1 million dollar Netflix recommendation algorithm competition. To read more about that competition and how this algorithm was adapted for collaborative filtering check out this blog post. The implementation that we used was from the surprise library.
Let’s dive a little into how SVD works. SVD at its core relies on ratings that users supply for items in order to inform making predictions on how a user would rate a new item. Making a recommendation from this is as simple as predicting how a user would rate all items that they have not interacted with (in this case shared) and sorting that list to return to the user. In order to make this prediction, first we construct a matrix where each row represents a user, and each column represents an item. The elements of this matrix are the ratings that are given to items by users. The diagram in Figure 2 below shows the rating matrix on the left side of the equation.
Through some linear algebra this matrix can be decomposed into a user embedding matrix and a post embedding matrix. If you would like to see some details on how this works, please refer to this excellent article. However what is important to understand is that this decomposition can take place even if there are values missing from the ratings matrix. What this means is that, using the user embedding matrix and the post embedding matrix obtained from the decomposition, predicted ratings for each user item pair can be obtained by multiplying the corresponding user row and post column together from these two matrices. Thus the ratings matrix can be filled in with the missing values. All that remains is to filter the matrix for posts that have already been shared, and present the posts that have the highest predicted rating to the user as recommendations. This is how a basic recommendation system works!
The Amplify product does not currently have a way for users to rate pieces of content that they share. This presents a problem as, in order to decompose the ratings matrix into a user matrix and a post matrix, you need to have some initial values filled-in (analogous to training data in supervised problems). In fact, the more you initially have, the better you should expect the system to perform (however you can’t have them all filled in or else there wouldn’t be anything for the system to predict) However, this problem can be overcome by using implicit ratings instead of explicit ratings.
So what is an implicit vs explicit rating? Explicit ratings, like a 5 star rating given to items on Amazon, are provided by the users and are intentional. Implicit ratings are inferred from the actions of users. E.g on Netflix, if they watched a movie, and for how long they watched it.
In the case of our system we use a read event from the user, and a share event from the user in order to form our implicit rating. We also hypothesize that incorporating if the user has customized the post by adding their own text would improve our results, however this has not yet been implemented.
As a point of interest, we found that a large portion of our users share posts without reading them! See the pie graph in Figure 3.
Hopefully our recommendation engine will encourage users to read the posts before sharing them!
The system architecture for our recommendation engine is shown in Figure 4.
This is a similar design to previous systems that our team has built (The Suggested Tag Service, and the Suggested Reply Service), with a new addition of hosting the prediction endpoint on Kubernetes. Previously the prediction endpoints were hosted on AWS Sagemaker, however we made the decision to move off of it as it was restrictive and did not allow us to integrate into the system of tooling that Hootsuite develops for general usage from each of our dev teams.
The flow is as follows: a user will share or read an article on Amplify. This emits an event to our Kafka stream that is ingested by an ingestion service into S3. The data is processed and augmented using a Spark job that calls an eternal id to add necessary information, then put back into S3 as processed data. Then we use AWS Sagemaker to make a training job that will take this processed data and produce a model. This model can then be deployed in a Flask based Kubernetes service from which the front end will ask for predictions!
As Hootsuite is an international company, not all of our users share posts in the same language. In fact, some of the companies that are our customers have posts in multiple languages. With that being said, we don’t want our recommendation system to make a recommendation to a user outside of their language. Thus, we use a language detection library in Python (currently langdetect but in the process of moving to fastText as it is, as its name suggests, fast) to generate maps of the post ids and what their languages are. We then only suggest posts that are the same language as what a given user has shared or read in the past.
This section is a bit of a slog, but is important when you are building recommendation engine. Measuring a recommendation system performance using accuracy is sort of an anti-pattern. If you use accuracy you assume that the best system is one that would recommend posts that a user definitely would have shared. What is more desirable is to recommend posts that a user would not have thought of, but in which they still genuinely find value. Thus our team looked into a few different metrics to use.
Diversity — measures how narrow or wide the spectrum of recommended products are to a single user. A recommender that only recommends the music of one topic of post is pretty narrow; one that recommends across multiple topics is more diverse.
Coverage — reflects the degree to which the generated recommendations cover the catalog of available items, wider coverage increases user’s satisfaction. Coverage is not defined at the level of an individual user, but rather at the level of the system .
High diversity != high coverage, eg: if different users are recommended the same diverse set of items, the average diversity of the system will be high, but the coverage will remain low. Diversity is to measure per user’s recommendation, while coverage is a system level measurement.
Novelty — measures how new, original, or unusual the recommendations are for the user. In general, recommendations will mostly consist of popular items because (i) popular items have more data and (ii) popular items do well in offline and online evaluations.
In the end we chose to add Coverage as a metric that we tracked and potentially optimized for. We dropped computing diversity since the computation takes a long time and is not practical during production; we also dropped novelty, as after discussing with our stakeholders, it was not a trait that they desired.
We tried two methods to boost the aggregated coverage score. The first method forces a random post for each member. Such randomness is controlled by a randomness seed value and an accuracy threshold (how confident the model is in its predictions). A hyper-parameter search suggests that using a threshold=0.5 and randomness factor of 42 yields an aggregated coverage score of around 0.77, which seems reasonable to use in production.
Another method we attempted uses a specific formula to re-rank the list of recommended posts per member, where the formula gives more weight to posts that are not in the top_k recommended items and less weight to posts that are one of the top_k recommended items. This method also provides a few tune-able hyper-parameters, however, after experimentation, the results were not ideal and took fairly long time to compute.
Therefore, we chose the first method of using a randomly inserted post to boost the recommendation coverage.
System Monitoring and Detecting Crashes
In order to detect when performance is dropping, or when our inference service is not returning predictions, we built dashboards using Grafana. This allows us to detect when the system is returning values too slowly, or returning errors. We can then be alerted via Slack so we can take quick action.
The system retrains on a daily cadence. The ingestion service is constantly streaming data into the training buckets and if we wanted to, we could decrease the cadence down to an hour, which is the time that it takes new models to train in Sagemaker. However it was deemed that a daily retrain is acceptable.
In order to measure how well this system works in production we use a standard acceptance rate. How often does a user share a post when they see a recommendation from our system. This metric can be seen in Figure 5 showing a graph that we made using Mixpanel.
On average the posts are shared roughly 28% of the time, which our team is fairly happy with. In order to improve results, we are looking to use Bayesian Optimization on the model parameters and incorporate different elements in our calculation of the implicit rating for each user. We also have looked into doing A/B tests where we present a different amount of suggestions for each variant. Currently the system will make a decision for how many to present based on its confidence level; however, we can see using Mixpanel to drill into our results, that users that are presented with 3 recommendations only have a higher success rate. This suggests that an experiment where a random sub-sample of users are presented with a fixed number of recommendations is in order. The results we noted from this Mixpanel analysis are shown in Table 1.
Recommendation systems are ubiquitous on the internet. They help increase customers usage of your business because they help customers find what they are looking for. Hopefully this blog post will help you if you are thinking of implementing your own.
With that being said I have a list of people I would like to thank my teammates on the Machine Learning Team for working hard to deliver the Suggested Post project (Aman Bhatia, Imtiaz Jadavji, Celine Liu, Anthony Donohoe, Michael Xian, and Ana Ramdas) as well as the Amplify team.
If you have any questions about this project, please feel free to reach out to me on LinkedIn.