A recommender system (or simply ‘recommender’) is an algorithm that takes a large set of items and determines which of those to display to a user — think the Facebook News Feed, the Twitter timeline, Google News, or the YouTube homepage. Recommenders are necessary tools to help navigate the sheer volume of content produced each day, but their scale and rapid development can cause unintended consequences. Facebook’s algorithms have been blamed for radicalizing users, TikTok’s for inundating teens with eating-disorder videos, and Twitter’s for political bias.
Understanding how to mitigate such effects requires knowledge of how these systems work. At first glance this might seem impossible, because the algorithms used by each platform are proprietary. However, they share common principles. This post is based on public information found in company blog posts, academic papers written by platform employees, journalistic investigations and leaked documents. Each of these sources has limitations, but taken together they show repeating patterns of design and operation. While details can be scarce, the basic operation of these systems is not mysterious.
A Typical Recommender
Modern recommenders are pipelines with multiple stages. All recommenders start with the entire set of items available to be displayed — whether that’s every post from your Facebook friends, every news article published today, or every song on Spotify. This set is first filtered by moderation, where items belonging to certain undesirable categories are identified and removed. The remaining set is still intractably large, so a candidate generation algorithm selects a subset of items that are plausible candidates for recommendation. These candidates are ranked according to a primary metric, usually a measure of how likely the user is to engage with the item. The items are often partially re-ranked to improve secondary objectives such as diversity in the types of content recommended. Finally, the top items are shown to the user. This final set of items is called a slate.
The first stage is moderation, in which undesirable items are removed from the pool or flagged for special treatment.
The word “moderation” encompasses a complex process that determines what items are allowed on platforms, including policy-making, human content raters, automated classifiers, and an appeals process. All these steps influence what items users see, but only some operate within the core recommender pipeline. In this post, we use “moderation” to refer to the automated processes that remove items from the pool of content eligible for recommendation. Depending on the country, companies can be held liable for hosting content relating to a variety of issues such as copyright, defamation, CSAM or hate speech. Most platforms also have policies (such as those of Facebook, YouTube, or Twitter) to filter out content that is believed to cause harm, such as nudity, coordinated inauthentic behavior, or public health misinformation.
The bulk of moderation is performed by a series of automated filters designed to catch different categories of undesirable content. What happens to items caught in these filters will differ depending on the category of content and platform policy. Among other possibilities, they may be removed from the pool of items eligible for recommendation, or flagged for down-ranking in a later stage of the pipeline.
In the candidate generation stage the full set of items available on the platform (potentially millions) is efficiently filtered to a set of 500 or so that are plausibly of interest to the user.
Some recommenders mostly choose items from people or groups a user follows. In these contexts, the candidate items are the posts that have been created by these sources since that user last logged in, along with items ranked highly in previous sessions that they haven’t yet seen. This is the case for the Twitter timeline and the Facebook News Feed.
Other recommenders regularly show items from sources that users haven’t explicitly followed. Indeed, on some platforms (e.g. Google News, Netflix) there is no concept of “following”. In these contexts, candidates are typically chosen using a simpler, less accurate, but more computationally efficient version of the full algorithm used in the ranking stage. For example, if the ranking stage uses a large, computationally intensive model (e.g. a neural network) to predict the probability of a user engaging with the item, the candidate generation stage might work similarly, just with a much smaller model that can be applied to a larger set of items. This reduced model may have been trained to emulate the behavior of the full model used in the ranking stage as best it can. This is roughly how candidate generation is performed on the YouTube homepage and in the Explore view on Instagram.
In the ranking stage each item is assigned a number intended to capture the value of showing it to a particular user in a particular context. Every recommender serves multiple stakeholders, so this will be some combination of value to the user, the content provider, the platform, and society. The items are then sorted by their score from largest to smallest. We will use the term value model to describe the formula used to compute these scores, a name that is used by some platforms. Other platforms call this a scoring function.
In most platform recommenders, the value model is primarily a weighted sum of the predicted probabilities that the user will interact with the item in different ways, such as clicking, commenting, sharing etc. These interactions are informally known as engagement.
For example, consider a feed or timeline on a social media platform. There are multiple ways a user can engage with an item. These include explicit inputs such as liking, commenting, and sharing, but also more implicit data such as whether a user clicks on links to specific domains and how much time they spend looking at an item (known as “dwell time”).
For any given user and any given post, the platform has a model that predicts Pr(like), Pr(comment), Pr(share) and so on, the probabilities that the user will engage with the post if it is shown to them. These probabilities are produced by machine learning models trained to predict how a particular user will interact with a particular item in a given context (see e.g. YouTube, Twitter). These models are trained on historical engagement data, and their objective is predictive accuracy.
The core value model is a weighted combination of these probabilities. In its simplest form, it might look something like
The use of engagement terms in this value model is what is meant by the phrase “optimizing for engagement.” The weights in front of the probabilities are intended to capture the degree to which different types of engagement are valuable. The weights can be selected in a variety of ways, and may be:
- Skewed towards particular types of engagement (e.g. YouTube prioritizes watch time of videos, TikTok prioritizes retention and time spent using the app).
- Negative, if the corresponding type of engagement indicates disapproval (e.g. clicking “See Fewer Posts Like This” on Instagram).
- Personalized to each user (as in the Facebook News Feed).
- Chosen algorithmically to optimize a single overriding metric, such as retention, that isn’t optimized for directly (e.g. Google, LinkedIn).
- Regularly adjusted to respond to changes in priorities, or changes in the user interface that alter the significance of different types of engagement (as in the Instagram Explore view). For example, if the ‘like’ button in the user interface is made bigger, people are more likely to click on it and so it reduces in significance as a signal of value.
The value model typically includes additional terms that are not predictions. That is, no real platform optimizes solely for engagement. For example, there might be additional terms added to boost or penalize items that were flagged during the moderation stage. These “‘integrity signals” might include probabilities that the item is low quality news or ad farm content. So the value model will look more like:
In real recommender systems the equation will be more complex, with a larger number of engagement types and integrity signals being included. But the basic structure seems to be standard practice in all the major news and social media recommenders about which information is publicly available. Scores of this sort are used to rank items in the Facebook News Feed, Twitter notifications, YouTube watch next suggestions, the Instagram Explore view, and on TikTok.
So far, this approach to ranking does not take into account the relationships between items in the ranked list — the position of each item is determined independently of the others. This can lead to the top-ranked items being too homogeneous. For example, they may all relate the same political story, if that story causes outrage and thus high engagement.
Thus, a re-ranking stage is used to tweak the positions of the items in the ranked list to improve the quality of the recommendations in their final context. In this phase, items aren’t selected based on individual appeal, but the appeal of the whole slate of items to the user, which depends on the overall mix. The re-ranking stage might aim to position complementary items near one another, prevent boredom by improving diversity in the item topics, promote fairness by improving diversity in the accounts presented, or counter popularity bias (the tendency of the ranking stage to unduly prioritize popular, “mainstream” items).
For example, the Instagram Explore view increases the diversity of accounts represented by adding a penalty factor to “successive posts from the same author or seed account”. YouTube has experimented with ensuring that at most n out of every m items can fall within a certain similarity distance of each other.
The exact properties sought depend on the context. There are, for example, legitimate cases in which a user should be served similar items, such as during a breaking news event when information is continually arriving, on a music streaming service when the user has requested a particular genre, or when recommending groups a user might want to join.
Following the re-ranking stage, the top items in the re-ranked list are selected as the slate shown to the user. The ranked list of content may also be interleaved with targeted ads, which are selected by a different recommender system.
The Recommender Pipeline
The diagram below shows how items flow through a typical recommender pipeline. At left is every item eligible for recommendation. This could be a huge number of items, as on YouTube, or it could be only the posts from a user’s friends. Moderation is typically performed once, removing the same items for all users, while candidate generation, ranking, and re-ranking are personalized to a particular user and context.
Particular systems will differ in their specifics. For example, moderation (the removal of items) may be performed at multiple points throughout the pipeline, and adjacent stages may be performed in conjunction and not easily separated at the algorithmic level. But this description is accurate in its broad strokes. There is no need for recommender systems to be mysterious — they all work on the same basic principles.
Luke Thorburn was supported in part by UK Research and Innovation [grant number EP/S023356/1], in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (safeandtrustedai.org), King’s College London.