What’s Right and What’s Wrong with Optimizing for Engagement
— Priyanjana Bengani, Jonathan Stray, Luke Thorburn
“Engagement” is a complicated, loaded word. Everyone from platforms to journalists to advertisers seeks it — even praises it — but the pursuit of engagement has been blamed for many social ills including addiction, polarization, radicalization, eating disorders, and misinformation. For recommender systems, optimizing for engagement means choosing content that users are likely to respond to, and most large recommender systems are built this way. While this can be a problematic strategy, user engagement often does signal genuine value, because people tend to give more attention to things that matter to them. It’s up to platform operators to balance this usefulness with the negative consequences of chasing engagement.
In this post we explore engagement in the context of recommender systems, the personalized content selection algorithms that power a wide variety of products including social media, news aggregators, music and video streaming services, and online shopping. We’ll try to unpack the advantages and disadvantages of using engagement in different ways, while also evaluating the imperfect alternatives. As we are likely to continue seeing engagement used in personalization for some time, we believe the most promising approach is to get better at differentiating between “good” and “bad” engagement.
Defining Engagement
“Engagement” has been used in software engineering, and interactive systems in general, since at least the 1980s. One 2008 review says that “successful technologies are not just usable; they engage users,” noting that an engaging system “push[es] the boundaries of user experience from merely perfunctory to pleasurable and memorable.” This is why product designers see engagement as positive, and why it is attractive for many types of systems including recommenders.
On the other hand, recommender systems are a type of media, and media organizations also strive for engagement. For journalists, “engagement happens when members of the public are responsive to newsrooms, and newsrooms are in turn responsive to members of the public.” Similarly, recommender systems that select engaging content have the advantage of being responsive to users. Measuring audience responsiveness is another matter, and a wide variety of metrics are used to represent engagement, including views, shares, time on site, comments, reactions, and bounce rate.
Engagement may be a sophisticated, holistic concept, but in practice it is usually monitored through online behavioral data. Engagement metrics can be signals of value to multiple stakeholders. Ideally, when users engage with something — an app, an article, a video — it’s because they get genuine value out of it. At the same time, businesses need engaged consumers to survive regardless of their business model: ad-supported products need to capture user attention, subscription products need to provide services that keep people coming back, and even non-profit organizations need to be able to tangibly illustrate their impact to members and funders (even the BBC optimizes for engagement). As critics from academics to journalists to governments have noted, engagement is not equivalent to value. Yet most recommender systems continue to be based on optimizing for engagement because it is a plausible proxy for user values in the absence of other easily observable data.
Before we can discuss what is good and bad about algorithmically selecting the most engaging content, we need a definition in the context of recommender systems. Putting all the above pieces together, we propose:
Engagement is a set of user behaviors, generated in the normal course of interaction with the platform, which are thought to correlate with value to the user, the platform, or other stakeholders.
This definition reflects the fact that engagement signals are chosen to be indicators of value, but aren’t necessarily going to be fully aligned with all values of all stakeholders in all situations. It says that engagement is something users normally do, as opposed to less common actions such as adjusting settings or taking surveys. It also suggests that there are some signals of value that can only be derived from non-behavior data, which parallels well-known arguments from economics that a person’s preferences cannot be inferred solely from the choices they make. The definition is also a multistakeholder perspective, as recommenders simultaneously serve different groups.
Good vs. Bad Engagement
Engagement is a double-edged sword. It is necessary for economic sustainability and often indicates value, but blindly seeking engagement can also have negative consequences. This situation is somewhat analogous to “profit”: ideally, a company is profitable because customers are buying something they find genuinely valuable, but there are also deceptive or harmful ways to make money.
Optimizing for engagement can lead to showing users addictive content, which can cause self-control problems impairing other areas of one’s life. A recent study paid people not to use various apps, including recommender-driven social media, and observed long term drops in subsequent use, implying social media use is habit-forming for some people. One theory is that recommenders might be designed to choose items that provide a “little dopamine hit” like gambling or drugs, resulting in a kind of behavioral conditioning. Activities like binge-watching or doom-scrolling can mimic addictive behavior by creating unhealthy dependencies.
Optimizing for engagement might also lead to incentivizing divisive, extreme, or outrageous material. There is evidence that more extreme content leads to more engagement, and some researchers consider recommender-driven platforms angry by design. One large review of “moral contagion” found “each message is 12% more likely to be shared for each additional moral-emotional word.” Other studies have found that divisive and extreme material is more likely to drive engagement. Internal research at Facebook has found that “no matter where we draw the lines for what is allowed, as a piece of content gets close to that line, people will engage with it more on average — even when they tell us afterwards they don’t like the content.” However, platforms have intervened and implemented policies to reduce the reach of various kinds of engaging but objectionable content. While Twitter found that the mainstream political right wing “enjoyed higher amplification” than the left in six of the seven countries they studied, they also found that “far-left” and “far-right” political parties and elected officials were not algorithmically preferred (relative to a reverse-chronological feed).
A related problem is that optimizing for outrage-heavy and extreme content can leave the algorithm open to abuse by users who have a deep understanding of how the system functions and therefore can take steps to manipulate it through inauthentic activity. This type of activity has been adopted by a wide array of user types, from state actors to Instagram influencers.
If users are more likely to click on a particular type of content, most recommenders will serve them more of it. If the user then starts to prefer whatever they frequently see, it can lead to feedback loops causing “filter bubble” effects, polarization, or even radicalization ultimately leading to violence. This possibility has been demonstrated in simulations, but observing it with real users on real platforms is difficult. For example, TikTok bots programmed to watch only sad videos were eventually recommended mostly depressive content, but it’s difficult to know what happens to users in the real world from this type of experiment. The best real-world evidence we have comes from case studies linking recommender-driven platforms with eating disorders and violent radicalization. It has also been argued that recommender systems cause society-wide polarization — different from radicalization which involves only a small number of people — however the evidence for this is mixed.
It is natural for users to instinctively react to the outrageous, the extreme, and the sensational. However, this can lead to recommendation engines misinterpreting user activity by assuming this is the content users value. Experiments have found metrics based purely on immediate feedback signals are insufficient and do not represent long term value to users. This problem is closely related to the difficulty of inferring preferences from behavior.
If not engagement, then…
Many users, creators, and analysts advocate for alternatives to engagement-based ranking, most notably a reverse chronological timeline. Both Twitter and Instagram moved away from a chronological feed to an algorithmic feed in 2016, which Instagram said was to prioritize “the moments we believe people will care about the most”. Because of user feedback both re-introduced the chronological timeline (Twitter in 2018, Instagram in 2022), though the default home feeds remain algorithmic.
Yet chronological feeds are not a reasonable replacement for recommender systems. First of all, they only work for platforms designed around “following” a small set of sources. A feed of all new items would be near useless for Spotify, Netflix, YouTube, Google News or Amazon. Even where chronological feeds make sense, they come with their own issues. Frequent, spammy posters have an advantage, and posts are quickly lost to time. Instagram claims that with the algorithmic feed, “people see 90 percent of posts from their ‘friends and family,’ compared to around 50 percent with the reverse chronological feed.” Instagram also found that users spend more time on the app with the algorithmic feed compared to the chronological feed, which is likely more profitable, and may or may not represent genuine value for the user. On the other hand, a Facebook experiment found that users spent more time on a chronological feed than an algorithmic feed, but clicks and shares went down and users complained of low quality content. This illustrates that there are many different kinds of engagement, and they won’t all go up or down together.
Meanwhile, Reddit relies on a slew of different ranking algorithms — “New”, “Hot”, “Top”, and “Rising” — each of which orders items slightly differently. “New” is reverse chronological, while “Hot” is the difference between upvotes and downvotes on a logarithmic scale (the first ten votes are weighed the same as the next hundred) plus a decaying bonus for recency. Voting is also a behavioral response under our definition, which means even this relatively straightforward approach is an example of engagement-driven algorithmic ranking. However, some types of engagement, such as votes and likes, come from controls designed to solicit feedback, while other types of engagement, such as clicks and watch time, are more implicit. In general both implicit and explicit engagement signals provide distinct and useful information to recommenders.
Implicit signals can be ethically dubious when users have little control or understanding of the data being collected and how their interactions on a platform are interpreted. Many users are not aware that an algorithm exists at all. Despite the possible benefits of maximizing engagement, users might be uneasy about the specificity of recommendations, or they might be uncomfortable with the data collected or profiles created, which are not necessarily accurate representations. Instead of asking whether or not a platform optimizes for engagement, it can be more helpful to ask how explicit or implicit these engagement signals are, and provide users with more controls to allow them to tweak their online experience.
Finally, it is possible to engineer a content selection algorithm which evaluates each item based on pre-trained or hand coded proxies for important characteristics such as credibility, informativeness, and perspective, without using data collected from users. The news aggregator The Factual ranks content without any sort of likes, clicks, or voting signals, though not using engagement signals at all is unlikely to work for other domains like social media and music recommendations. Still, not all optimization is algorithmic. The Factual must still tune its content selection algorithm to create a product that keeps users coming back.
Building On Engagement
Despite these issues, optimizing for engagement still forms the core of content recommendation on most platforms. While it can be important to offer alternatives, engagement is just too useful a signal of multi-stakeholder value to give up. Instead, it may be possible to get better at distinguishing “good” from “bad” engagement.
No real platform optimizes solely for engagement. Other factors play into content selection, including removing items that violate platform policies (often called integrity work), down-ranking certain kinds of objectionable content to minimize its spread, and implementing circuit-breakers to curtail the spread of viral content until a human reviewer can weigh in. It is also widely understood that prioritizing short-term engagement can reduce user retention in the long term. Sensationalized headlines, explicit images, or clickbait conspiracies might capture user attention, but can damage user trust or lead to unhealthy relationships with the system. The healthy use of engagement signals requires continuous product iteration, driven by ongoing assessments of how well the recommender is serving users and other stakeholders.
Fundamentally, engagement is behavioral and can only imperfectly capture the emotional, cognitive, and life experiences of the user. Continuous user research including surveys, interviews, focus groups, and A/B tests can illuminate existing issues and reveal new ones. For example, Facebook’s data scientists found that the “angry” reaction was used more frequently on “problematic posts,” and was “being weaponized” by political figures. It may be useful to focus on specific subsets of users, such as users whose engagement levels have dropped off, or users exposed to a disproportionate amount of content later demoted or labeled. The insights from such research can be used to tune the content ranking value model to prioritize more reliable indicators of quality, whether behavioral or not.
Engagement, for all its flaws, will be with us for the foreseeable future. If a system has no engagement — no users, no new posts, no activity whatsoever — it might as well not exist. But there are choices other than “optimizing for engagement” and “not optimizing for engagement.” Careful use of engagement signals, including ongoing evaluation of what each type of engagement represents, can create better user experiences and greater value for all stakeholders, including content consumers, content creators, advertisers, the platform itself, and society at large.