How I Re-Created Duolingo’s Famous Notification Algorithm

13 min readApr 26, 2024

Table of Contents

multi-armed bandit problem
recovering difference softmax algorithm
training
testing
results
conclusions
github code implementation

As a user of Duolingo for over 5 years, I’ve gotten pretty used to their unique notification system that refuses to let me give up learning French (my pride is 100% wrapped up in my streak).

Whenever life gets busy and I decide that I can’t squeeze in a lesson, Duolingo reminds me at exactly the right time with motivation to get me off my butt. In fact, their algorithm works so well it’s become an internet meme.

Little did I know about the intricate algorithm behind the scenes, strategically pulling the right levers to send me notifications that Duolingo KNEW would motivate me to complete my daily lesson.

Discovering this left me initially shocked, yet deeply impressed by the meticulous thought and consideration poured into the seemingly minor act of sending app notifications.

I needed to discover how this magic, or rather science, worked. Join me on my quest to unravel the secrets of this ingenious computer science feat, as I delve deep into my implementation of the famous (infamous?) Duolingo notification algorithm that has successfully guilt-tripped me into keeping up with my daily lessons for years.

Multi-Armed Bandits

To understand how Duolingo sends out notifications, it is first important to grasp a famous computer science problem called the multi-armed bandit problem.

In this scenario, a gambler must decide which arm of non-identical slot machines to play in a sequence of trials so as to maximize his reward [1].

The difficult decision comes with a tradeoff referred to as the exploration vs exploitation decision. This essentially means how much you should explore the different slot machines to find the best one compared to repeatedly playing the best one you’ve found so far to maximize earning potential.

Exploring all the slot machines means minimizing short-term rewards, but exploiting early and often can mean you never find the best one.

Duolingo’s notifications represent a multi-armed bandit problem, where each type of notification is like a different arm, aiming not at gambling success but at prompting a user to complete a lesson within 2 hours of receiving the notification.

Duolingo’s challenge is determining which notification will most likely prompt a user to complete their lesson. However, they encounter unique challenges that complicate the classic bandit problem.

Sleeping Arms

In the original problem, all slot machines are available to the gambler at every round, but this is not the case for Duolingo notifications. Certain templates are available to users only under certain conditions.

For example, the notification “You’re on fire! Continue your 109-day Spanish streak on Duolingo” is only available to a user that has a 3+ day active streak [2]

An example of a Duolingo notification. Source: Duolingo

Because there is a unique set of notifications available to each user at a given time, figuring out which notification is universally the best proves more challenging.

Randomly, the best notification to send may no longer be an option, requiring the algorithm to adjust to these fluctuations. Yet, that’s not the only complication.

Recovering Arms

Have you ever noticed how Duolingo switches up the notifications it sends to you day-after-day? This is intentional, as sending the same notification over and over again can desensitize the user to it (and is frankly annoying).

So, even if the best notification is identified, it’s not necessarily a good idea to repeatedly use it over and over again. Duolingo researchers Kevin Yancey and Burr Settles refer to this as a “recovering arm” phenomena where new templates, even if they aren’t actually better in the long run, boost short term conversion rates.

The secret in this lies in a novelty bias where new, fresh notifications are more likely to spur a user to complete their lesson.

To address this in the algorithm, Duolingo implements a recency penalty that decreases the probability of a notification being sent if it was lately sent to the same user.

This way, they can keep notifications fresh and abstain from repetitive, spammy sending.

Now, enough background, let’s get into the algorithm.

Recovering Difference SoftMax Algorithm

Duolingo researchers dub their solution to these issues the Recovering Difference SoftMax Algorithm, which is a mouthful. For clarity, I’ll refer to it as the RDSA, for short.

The RDSA works by assigning a “score” to each notification in every round (where a round is when a notification is sent to a user), selecting the notification with the highest likelihood of prompting a lesson completion, and then updating the scores based on the notification’s performance history. Its goal is to maximize the reward function by picking the most effective notification for each user. This reward function assigns a value of 1 if the user completes the lesson and 0 if they do not.

The initial score of a notification is calculated using the following formulas:

Mu Plus calculation

Where:

μ̅+ is the average reward of a notification when it is the one sent to a user.
t is the number of rounds.
Ha is all this historical round where the notification was eligible to be sent.
at is the arm selected at round t.
wt is the probability of the arm being chosen.
rt is the reward function (either 0 or 1).

This gives us an approximation of how good notifications perform when they are selected.

However, we also need to account for how good other notifications perform when a template is eligible but not sent. In essence, do other arms perform better when a given arm is not chosen. To calculate this, we perform the following calculation:

Mu Minus calculation

Where:

μ̅- is the average reward when a notification is eligible but not sent.

We can calculate the initial score for each notification by taking the relative difference of these two values:

Relative Difference Score Calculation

Where:

S̅a is the score for notification a.

Through this, Duolingo avoids bias towards notifications that historically tend to be eligible more often. For instance, notifications are more likely to be selected in rounds with fewer eligible templates. This calculation normalizes the data, providing an accurate assessment of each notification’s effectiveness in every round.

Recency Penalty

You might be wondering about that recency penalty discussed earlier and how we apply it to the score of an arm. Fortunately, the data contains a history of notifications sent to the user dating back up to 28 days.

With this history, we can penalize the notifications that have been recently sent, ensuring that the algorithm favors fresher templates for a given user.

The calculation is based on exponential decay, which has been found to mimic human memory [3]. Taking into account this recency effect, a new score for each notification can be calculated as follows:

New Score Calculated from Recency Penalty

Where:

s* a,t is the new score of a notification.
d a,t is the number of days since a notification was last sent to a user at round t.
γ and h are hyperparameters (determined to be 0.0017 and 15 respectively through grid searches and research into how long it takes for memory to fade) [2].

By adjusting the scores of each arm depending on how recently they were last used, Duolingo can overcome the hurdle of spamming a user with the same notification.

Choosing an Arm

To select a notification for a given round, a SoftMax function is applied to the arm scores to get a probability distribution of all the eligible notifications. This new distribution is defined as a policy, denoted as follows:

New Policy Calculation

Where:

π(a | t) is the probability of choosing an arm in the given round under the new policy.
τ is a hyperparameter that controls the exploration factor (higher values equals more exploration) of the algorithm.
At is the set of all eligible notifications in a round.
a is the selected notification and a’ are all the eligible notifications not sent.

In each round, a notification is chosen based on this distribution. If a user completes their lesson, new scores for the arms are calculated, and the distribution is adjusted accordingly. Gradually, the most effective notifications will gain a higher probability in this policy, ultimately maximizing rewards.

So, now that we covered the basics of the RDSA, let’s dive into how I implemented my own version!

Training (Learning Notification Scores)

In my implementation, I only re-created the offline experiments, as I did not have access to real-time feedback loops to deploy the algorithm. Subsequently, there are mathematical techniques such as empirical Baye’s estimation (to control for small sample sizes) that Duolingo used in their online experiments that I did not need to perform with an offline evaluation dataset.

The first phase was learning all the arm scores from the training data. To do this, I needed to read in the data provided by Duolingo.

The data arrived in .tar.gz format, a compression format similar to zip that allows large files to be stored efficiently.

And this data was BIG, with over 200 million rows of notifications sent to users!

I first decompressed the files using the following command in Windows 10 using these command line prompts:

tar -xvzf filename.tar.gz

tar -xvzf C:\PATH\TO\SOURCE\filename.tar.gz -C C:\PATH\TO\DESTINATION

They were then extracted to Parquet files stored in my desired directory. Parquet is an open-source file format for columnar storage of large and complex datasets, known for its high-performance data compression and encoding support [4].

To convert the Parquet files into Pandas data frames in Python, I created the following class:

class DataReader:
    def __init__(self, file_path, chunk_size: int, max_elms: int):

        """This class reads in a parquet file and returns a formatted Pandas dataframe"""
    
        self.df = pd.DataFrame()
        self.chunk_size = chunk_size
        self.max_elms = max_elms
        self.parquet_file = pq.ParquetFile(file_path)


    def read(self):
        """This function converts parquet rows into a pandas dataframe in batches"""
        counter = 0
        for i in self.parquet_file.iter_batches(batch_size=self.chunk_size):
            i = i.to_pandas()
            self.df = pd.concat([self.df, i], axis=0)
            counter += self.chunk_size

            if counter > self.max_elms:
                break
        #Changing the completion column to 0 or 1 for reward function
        mapper = {True: 1, False: 0}
        self.df['session_end_completed'] = self.df['session_end_completed'].map(mapper)
        
        return self.df.reset_index()

The class allowed me to work with parquet data within Python. With the data now accessible, I began my implementation.

Arm Score Algorithm

To calculate the arm scores, I one-hot encoded the “eligible_templates” column (containing a list of all eligible notifications in a round) using the MultiLabelBinarizer() from scikit-learn and converted the “session_end_completed” column (whether the user completed a lesson within 2 hours) to a binary 0 or 1 value to represent the reward function. This allowed me to more efficiently perform sums and searches.

#Setting data fram equal to passed arg
df = train_df

#One Hot encoding 'eligible_templates' column, which contains values of lists
 mlb = MultiLabelBinarizer()
 df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('eligible_templates')),
                          columns=mlb.classes_,
                          index=df.index))

I also created the following dictionaries to perform my calculations:

A dictionary to house indices of rounds where a notification was selected (for mu plus).
A dictionary to house indices of rounds where a notification was eligible but not selected (for mu minus).

Using these data structures, I was able to calculate mu plus from the average reward where a notification was selected by selecting rows where the reward was 1 and dividing by the historic rows.

for i in unique_arms:
        #Calculating Mu_Plus for each arm
        arm = df[df['selected_template'] == i]
        average_selected_reward = len(arm[arm['session_end_completed'] == 1])
        length_hist_rounds = len(arm)

        mu_plus = average_selected_reward / length_hist_rounds

I was also able to compute mu minus for all the rounds where a notification was eligible but not sent through the same method.

From these variables I could calculate the score for each notification and store it in a new arm score dictionary.

Testing (Deploying the Arm Scores on New Data)

Once the arm scores were calculated, I put them into action with new data. Initially, Duolingo’s policy involved randomly selecting notifications from a uniform distribution, where each notification had an equal chance of being chosen in every round. I needed to see if their algorithm would perform better than this random selection.

Looping through each round, I updated the scores of all eligible arms by applying a recency penalty based on the “history” column. This adjustment reduced the probability of recently sent arms in the new distribution.

for index, row in df.iterrows():
    arm = row['selected_template']

    #Getting days since arm a last chosen from history column to retrieve d
    for i in row['history'][::-1]:
            if i['template'] == 'I':
                 break
             else:
                 days = i['n_days']
                 #Adding decay function to score for recency effect
                 decay_arm_score[i['template']] = (arm_score[i['template']] - 
                                                   (0.017*0.5)**(days/15))

Following this, I computed the policy for the round using an Argmax function. Because exploration reduces reward in the short term, Duolingo recommended using Argmax instead of SoftMax in these offline evaluations.

for eligible in row['eligible_templates']:
        # Choosing at random from uniform distribution
        old_policy[eligible] = (1 / len(row['eligible_templates']))
            
        soft_max_prob[eligible] = math.exp(decay_arm_score[eligible] / explore)

        # Calculate new policy using argmax
        arg_max_arm = max(soft_max_prob, key=soft_max_prob.get)
        new_policies = {arm: 1.0 if arm == arg_max_arm else 0.0 for arm in soft_max_prob}
    
        #Getting differences in policy weights from old policy and new policy for each arm
        diff = {key: new_policies[key] / old_policy.get(key, 0) for key in new_policies}

Evaluating new Policy

Because the data already contains a selected notification and a known reward, we cannot know if a different suggested notification would perform better. In this offline evaluation, Duolingo used weighted importance sampling where they take the difference of arm probabilities in the two policies (original and new) and average them across the reward function.

If the new reward is greater than the original, we can be sure that the algorithm has provided some benefit.

To take the difference, I used a dictionary comprehension to compute new values that I could multiply by the reward.

diff = {key: new_policies[key] / old_policy.get(key, 0) for key in new_policies}

To assess the effectiveness of the new policy, we can compare the reward averages using this calculation:

Where:

r̅π is the reward of the new policy.
T is the number of rounds in the test set.
π(a | t) is the new policy.
bt (a | t) is the original policy.

By using these weighted differences in the reward policy, we can obtain an estimate of the potential lift with the new policy.

Results

After training on over 5,000,000 rows (my laptop had memory limitations that prevented assessing the entire dataset), I found the following scores for each notification:

These new scores would be updated each round in the test set, but they provide a better starting point than the baseline policy of equal probabilities for all notifications.

After applying this new policy to 5,000,000 rows of testing data, I was able to achieve a greater reward than the original policy using the off-policy evaluation method.

I was able to achieve a 1.6318% relative difference increase in reward with the new policy that included the bandit algorithm and recency penalty! This demonstrates the effectiveness of the algorithm in getting users to complete their lessons.

Policy Scores after 5,000,000 Rows of Test Data

I also incorporated Duolingo’s offline experiment of including the language of a notification as part of an arm. The reasoning is that different languages and cultures respond differently to varying notifications, so that language should be considered when choosing the best notification [2].

To achieve this, each notification + language combination was treated as a distinct arm. When controlling for language in this manner, the algorithm also performs well!

#Making each unique arm a tuple of notification and language
lang = df['ui_language'].unique()

    #Get unique arms from the concatenated array
    unique_arms = []
    for i in df['selected_template'].unique():
        for j in lang:
            unique_arms.append((i, j))

By incorporating clever mathematics and human psychology into its design, Duolingo has been able to choose better notifications to send to its users. I was able to replicate the results they produced in their offline evaluations with my own implementation of their novel RSDA!

Conclusion

While I could not cover every detail of the Duolingo paper or my implementation methods, I hope that this post helped shed some light on Duolingo’s deviously genius tactics that get us all to learn. It was honestly a blast to dissect an algorithm that has had such a large impact on my daily life!

They use a variety of new considerations to improve upon original solutions to the multi-armed bandit problem, including adding a recency penalty to account for the novelty we humans adore.

With access to greater computer memory and data, I would like to perform some of the more complex implementations, like combining past user interactions with notifications and training/testing on more rows.

Future research into this optimization can include adding the device type, user demographic information like age, and temporal components into the arms or arm selection process. While these methods will certainly introduce more complexity, Duolingo researchers have suggested that hierarchical models and other implementations can incorporate this new data to produce even better algorithms [2].

To learn more about Duolingo’s novel solution, you can visit their research page and read the original paper “A Sleeping, Recovering Bandit Algorithm for Optimizing Recurring Notifications.”

You can also visit my GitHub Repository to see all the code for my implementation of this algorithm.

References

[1] Auer, P. (2002). The non-stochastic multi-armed bandit problem. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS) (pp. 163–172). IEEE.

[2] Yancey, K. P., & Settles, B. (2020). A sleeping, recovering bandit algorithm for optimizing recurring notifications. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. https://doi.org/10.1145/3394486.3403351

[3] Hermann Ebbinghaus. 1885. Memory: a contribution to experimental psychol ogy. 1885. New York: Teachers College, Columbia University (1885).

[4] Snowflake. (2024). What is parquet? https://www.snowflake.com/guides/what-parquet#:~:text=Parquet%20is%20an%20open-source%20file%20format%20for%20columnar,for%20its%20high-performance%20data%20compression%20and%20encoding%20support.

How I Re-Created Duolingo’s Famous Notification Algorithm

Multi-Armed Bandits

Sleeping Arms

Recovering Arms

Recovering Difference SoftMax Algorithm

Recency Penalty

Choosing an Arm

Training (Learning Notification Scores)

Arm Score Algorithm

Testing (Deploying the Arm Scores on New Data)

Evaluating new Policy

Results

Conclusion

Written by Jacob Mazurkiewicz