Spotify’s song recommendation system: an ethical analysis

Published in

The Startup

6 min readSep 18, 2019

Human-Computer Interaction (HCI) has been the center of focus for researchers since the introduction of computers into commercial markets. This field is built on observations of the ways humans interact with computers and is at the intersection of computer science, behavioral sciences, and interface design.

As Artificial Intelligence pervades our everyday programs, researchers at Microsoft have recognized the need to define a consistent set of guidelines that analyzes HCI in AI-infused Systems. These guidelines assess interfaces utilizing Human-AI interaction on a multitude of characteristics and offer foundations to make clearer, human-centered interfaces that state how decisions were made and give the final authority to human users. Ultimately, this serves to evaluate AI-infused interfaces critically and alleviate some of the risks associated with AI’s probabilistic, but not always accurate, behavior.

The 18 guidelines proposed by Microsoft researchers are divided into four interaction stages: Initially — During Interaction — When Wrong — Over Time. In this article, I will be using a guideline from each category and evaluating how Spotify’s AI song recommendation performs against them. I will be scoring the interface on a 5-point scale for each guideline (1 for clearly violated, 5 for clearly applied), and providing the reasoning behind my scores.

Initially — Make clear how the system can do what it can do. Score: 4

Spotify does not offer an in-depth description of how its recommendation system works, but it does offer a concise explanation of what user behavior induced its recommendations. If you navigate your Spotify dashboard, you will encounter sections titled “Similar to X”, or “Because you played Y”. This offers information that could reduce users’ concern and help them understand the interface better.

During Interaction — Show contextually relevant information. Score: 5

When searching for a song title, the order of results varies from user to another and from location to another. Spotify uses a metric of song popularity, past preferences, and physical location to show relevant, contextually based recommendations. For example, when searching for a song titled “Crazy”, one user could get suggested “Crazy Rap” by Afroman first, while another could get “Crazy” by Charles Gnarly instead, depending on their different song libraries and browsing histories.

Different search outcomes depending on the context

When Wrong — Support efficient correction. Score: 3

“Discover Weekly” is a weekly-updated playlist that the AI creates based on changes in the user’s browsing behavior over the week. Users are given the option to like or unlike songs in it. When a song is unliked, the user is prompted to choose a reason why it was, and the song is “banned” from the playlist until it’s unblocked by the user. Corrections are only made to the updated playlist in a week based on last week’s user preferences which is relatively a long, inefficient correction time.

When Over — Remember recent interactions. Score: 5

Spotify offers various sections of its dashboard to remind users of recent interactions they had with the interface. This feature is divided into two subsections: “Recently Played” which displays recently played artists, playlists, and radio stations, and “Jump Back In” which reminds the user of older interactions that they may want to come back to. I personally think the interface performs phenomenally in this regard since it helps the user remember both recent and older interactions they might value.

Jump Back In feature (left), Recently Played feature (right)

To further explain how to apply those standards to Human-AI interaction, and contrast interfaces that apply and violate them, I will be choosing a different guideline from each of the four interaction stages. For each guideline, I will choose an interface that would score a 5 (Clearly Applied), and another that would score a 1 (Clearly Violated) and provide a justification of why that is so.

Initially — Make clear what the system can do

Score: 5 — Apple’s Health App: The interface displays the parameters it uses (step count, screen time, etc), and explains how they’re used to measure your overall health (physical activity, sleep quantity, etc). This assists the user in understanding what the AI system can do and how it can do it.

Score: 1 — Gmail Autocomplete: Google recently introduced a handful of extra AI features to Gmail silently. Users were not notified of these changes or how they work. Instead, they were surprised with auto-reply and autocomplete suggestions to their emails. While the feature is often helpful, the user has no idea where it comes from, nor how it obtains its information.

Apple Health App (left), Gmail Autocomplete (right)

During Interaction — Match relevant social norms

Score: 5 — Text Autocomplete: Autocomplete suggests words based of the user’s previous texting history and is usually neutral in its suggestions with its only biases being to words that the user has used frequently before. This ensures that the user experience is mostly expectable given the user’s social and cultural context.

Score: 1 — Snapchat Ads: Those are mostly based on the user’s browsing history and weigh recent user interactions heavily which results in many irrelevant ads. Many times, I found myself being suggested ads that correlate to a single article I browsed online recently out of curiosity, but something that I didn’t want to cover my feed with.

Apple Text Autocomplete (left), Snapchat Advertisements (right)

When Wrong — Scope Services when in doubt

Score: 5 — Amazon Alexa: Alexa obviously performs better with clearer, louder, and more fluent English voices. However, whenever it is uncertain of what the user said, it provides 4–5 options for the user to choose from instead of directly choosing one which reduces ambiguity and increases accuracy.

Score: 1 — Voice Texting: The AI here directly provides you with words/phrases that it thought you said. This becomes a problem for non-native, or even quiet, voices since it is not particularly accurate with them. Some systems allow users to change the phrases, but they still jump to suggestions directly without offering alternatives when in doubt.

Amazon Alexa options (left), Voice Texting (right)

Over Time — Learn from user behavior

Score: 5 — Instagram Explore feature: The “Explore” tab contains posts and videos relevant to the user’s browsing, following, and liking history. The feature is very dynamic in that you can sense it changing topics with correspondence to your change of interest. You can also always swipe the page down to update it with newer posts that could fit your interactions better.

Score: 1 — Uber’s destination autofill: This feature gives users autofill suggestions based on their previous rides. However, the order of these suggestions is merely chronological and does not correspond to the user’s most visited locations. This could arguably make for less personalized user experience.

Spotify’s song recommendation system: an ethical analysis

Written by Nazem Aldroubi