Bootstrapping Reinforcement Learning

How we built reinforcement learning datasets with Human-AI Interaction at Digg

Suman Deb Roy
9 min readJul 20, 2017

There are 3 overarching ways to do machine learning. Thereโ€™s Supervised Learning (โ€œthis is a carโ€ ๐Ÿš—), Unsupervised Learning (โ€œall these things look like carsโ€ ๐Ÿš— ๐Ÿš™ ๐Ÿš“) and Reinforcement learning (โ€œI can drive from one place to another in this, so it could be a carโ€ ๐Ÿš™ ). The choice depends on the dataset you possess and the environment where the solution will run.

At Digg, massive amounts of media data flow in from all corners of the Internet, aggregated by our products such as Digg Reader and Digg Deeper. We see more than 7 million unique urls every day, which must be analyzed for editorial, channels and notification purposes. One tool that assists in filtering is the Digg trending algorithm ๐Ÿ“ˆ โ€” which finds popular stories from the aggregated web and presents it in a way thats accessible to users, via Digg Bot on Facebook or Slack and Alexa.

Because trending news or bot messages must be classified in real time, we previously built a topic classification algorithm that takes texts from news articles or chat messages and predicts a suitable Digg tag for it. The Digg tag space comprises of more than 60 topic tags, such as โš– law, โš” warfare, ๐ŸŽฌ movies, ๐ŸŽฎ gaming , ๐Ÿ”ฌ science, ๐Ÿป booze etc.

While most ML models attempt to mimic the training data, sometimes this isnโ€™t enough. There are scenarios where new strategies must be devised impromptu from new encounters. Its where Reinforcement Learning is useful. However, it also means you must possess a reinforcement dataset in the first place.

In the real world, classification tasks can get complex quickly. News is always changing, so pre-trained word distributions donโ€™t always relate to fresh news topics or an userโ€™s query intent. We want the algorithm to navigate a topic space but optimize in the face of randomness. Thus, one component in our prediction pipeline is a Reinforcement Learning (RL) module โ€” meant to keep teaching the algorithm new topic patterns. ๐Ÿ‘ป

This article describes how you can build your own reinforcement learning dataset using a slack bot and some enthusiastic colleagues. In the process, you also get to automagically label your unlabelled data !

Emojis are reinforcements we use in every day digital conversations

And the fun part is โ€” weโ€™ll do all this with emojis ๐Ÿ˜‰. But first, why RL?

Rehearsing vs. Improvising in Machine Learning

While supervised learning requires labeled datasets and unsupervised needs lot of data, reinforcement learning ๐Ÿ” requires active trainers to tell the algorithm whats right from wrong. This allows the algorithm to adjust and adapt for future prediction. It is akin to A/B testing or multi-armed bandit techniques, which improvise based on user clicks to determine a best policy โ€” such as picking a better headline or an ideal layout to optimize CTR.

RL usually outperforms purely supervised and unsupervised models in complex tasks. In such tasks, its not enough to just rehearse all scenarios laid out in training. The algorithm must improvise, gracefully. But training to improvise can take time. Unlike in the supervised case, RL does not directly receive training labels or directional information ๐Ÿ˜ฃ. It has to actively try alternatives โ€” select, generate, test, contrast, trial and error. So why endure all this additional work? What is the real benefit?

Diggโ€™s Reinforcement Learning uses emoji reactions to capture user feedback on predictions, which can be compiled into a reinforcement learning dataset. For the example story here, the algorithm predicts food ๐Ÿ•, which is correct and is rewarded by 8 trainers. But 4 trainers also think the health tag is applicable.

The Real World is a Reinforcing Environment ๐ŸŒ

For any machine learning solution, the algorithmโ€™s prediction fidelity depends on the distribution of incoming data after the model is built. If this distribution is stable, static, relatively simple and matches training samples โ€” it is sufficient to learn just once. But if the environment is dynamic or exhibits or interacts with intelligence, the predictive model must keep learning at consistent intervals.

What are some good examples of such environments? A self driving ๐Ÿš—. A truly conversational ๐Ÿค–. Topic classifiers for ๐Ÿ—ž. AIโ€™s for ๐ŸŽฎ . Consider self driving cars. They need to adjust to careless drivers or changing weather conditions. Conversational bots must persist user attention even if they canโ€™t anticipate whether an userโ€™s next message topic will be related to its service vertical. News finds a way to connect old words to new topics. And AlphaGo must outmaneuver the intelligence of a 9 dan Go Master.

All these softwares have one common trait โ€” they are bound to encounter โ€œshiftingโ€ dynamics or semantics in their environment. They need to learn from recurrent experience to make robust predictions. Internally, RL methods involve adjusting millions of knobs (weights) to make the next prediction more accurate. The technical goal is to discover something called policy gradients โ€” correct actions learned from feedback data that is (hopefully) globally optimal. These actions are then internalized into a โ€œpolicyโ€ and embedded into the model.

The Surprisingly High ๐Ÿ’ฐ of Acquiring Reinforcements

There are lots of RL tools or optimization techniques and platforms available online. However, they only do a good job at improving your prediction accuracy given you have the right data. And therein lies the biggest caveat. RL research overly focuses on accuracy and frameworks, yet

The hardest part in bootstrapping a reinforcement learning solution is deceivingly simple โ€” its acquisition of the individual reinforcements.

Algorithms like AlphaGo and Atari AI have access to millions of recorded game results. In real world problems though, such opportunities are rare. Imagine if you had a dataset of every action taken within a startup, and whether the startup eventually found success or not. You could possibly come up with a reinforcement model for predicting startup success? But its hard to imagine something like that captured in a single place, or even digitally altogether. Building reinforcement learning datasets is hard.

Thus, you must actively go out and seek it. (1) One way to acquire reinforcements is as side effect of a highly engaging product โ€” in which users willingly โค๏ธ/ ๐Ÿ˜ก things to indicate consensus. (2) The other option is crowdsourcing.

Crowd-based reinforcements or feedback can be effective, especially because it involves domain and discipline-specific knowledge, together with awareness of social norms. A popular tool for this is Amazonโ€™s Mechanical Turk. However, even on such human-intelligence task platforms, most requests we see involve only classification, surveys, tagging and transcriptions โ€” there is little around reinforcements ๐Ÿ‘€

A Platform for reinforcement acquisition โœŒ๏ธ

After exploring several ways to acquire reinforcements, we settled on developing a game on Slack. Its not a surprising choice, given that every RL dataset (including the Atari and the Go ones) are in gaming environments, i.e. some kind of sandbox ๐Ÿ‘พ where you can capture every single action. Feedback on these actions is the food for RL.

Building bots on Slack can be frictionless. The main flow we designed was to have our algorithm (called RIO) predict topic labels (i.e. tags like ๐Ÿš—, ๐ŸŽญ, ๐Ÿค– ) for incoming news articles. We then pushed the predictions into to a slack channel periodically, together with UI for trainers to provide feedback. The UI was simple, something we use in real life all the time โ€” reinforcements through emoji reactions. All you need is these 3 to begin ๐Ÿ‘ , ๐Ÿ‘Ž and ๐Ÿ˜•.

Since most of our employees use slack every day, it could be an interesting break to spend a few minutes training the algorithm. When a trainer is unsure or perhaps the article could multiple topics that wasnโ€™t predicted, the confused emoji ๐Ÿ˜• is apt.

In order to settle on a topic/tag, RIO performs many actions on the NL text. These actions are stored in a prediction payload, which you can think of as data residue ๐Ÿช โ€” indicating how a prediction was reached. This data residue is key for not only debugging, but also explainability of the algorithmic results, enabling audit trails. Trainers donโ€™t see the whole payload in the UI, so their feedback is based only on the correctness of the final prediction.

Initially, as an RL algorithm is learning, predictions will go wrong. In fact, we want the system to explore and exploit all the time, especially early on. One ๐Ÿ‘Ž is sometimes more helpful in tuning the algorithm than consistent ๐Ÿ‘ in some categories.

The bot automatically pre-fills the 3 reinforcement options, so trainers can just press ๐Ÿ‘Ž if they disagree with the prediction instead of laboriously typing :emoji: out.

The Reinforcement Game

Gamification of the feedback task creates more incentive for participation. Every few hours, new stories would be pushed into the slack room so trainers could provide feedback. The more feedback someone provided, the higher her or his score would be ๐ŸŽ“. What really made this process fun was a sense of competition on who could become the lead trainer!

Invoking the leaderboard was a simple slack query.

Trainers could also retrieve their own score, which showed their position in the leaderboard.

A batch of reinforcements comprised what we call โ€œroundsโ€, and each round lasted about a week. At the end of each round, leaderboards are frozen. The lead trainer gets a ๐Ÿ†

This was very helpful with performance validation. But this UI solve doesnโ€™t solve a common problem companies face โ€” the lack of ground truth data. Datasets can be laborious to annotate and label. So how could we change the UI to automatically capture true labels for content?

The TIER Technique

During one of the game rounds, we noticed that trainers were adding the actual โ€œtagโ€ beside a wrong prediction to indicate ground truth.

Reinforcement ๐Ÿ‘Ž only tells if the machine predicted inaccurately. Emojis label the category of the data instance.

Taking inspiration from this feedback, our solution to acquire both ground truth + reinforcement in one UI rested on emojis that indicate a concept. We replaced the ๐Ÿ‘ / ๐Ÿ‘Ž feedback system with an emoji that indicated the actual tag of the story. We call this the Tag Illustrative Emoji Reaction (TIER) method for feedback.

In this design, the slack bot just auto-populates the TIER emoji corresponding to the predicted topic. Trainers press these emojis to hint it was right, or add their own TIER emojis to indicate an incorrect prediction.

The TIER system achieves two things at once. First, it gives us reinforcement about whether the predicted topic was correct. But perhaps more valuably, it allows captures ground truth of the data instance. If the predicted topic is the only emoji in the feedback, the reinforcement algorithm understands it was correct overall ๐Ÿ’ฏ

However, if there are zero new votes on the predicted topic but many votes on a user-suggested emoji, the algorithm realizes what the true label is supposed to be. Here are more examples of topic classification and reinforcement feedback using the TIER technique:

Slack UI for Human-AI interaction using the TIER technique

NLP is tricky โ€” so we might encounter surprises in prediction all the time. But the goal of integrating reinforcements is to improve robustness and weed out the risks.

Effect on Accuracy

The original RIO algorithm was trained on 4 years of editor-tagged stories that appeared on the Digg front page, comprising of more than hundred thousand articles. Before the reinforcement game, the prediction accuracy was about 73% in categorizing the trending corpus.

As we kept getting reinforcement feedback, RIO was able to tune itself. The accuracy crept up as rounds passed, until most recently โ€” RIO is at an accuracy of 96.5 % for trending stories. In the coming months, we plan to release more performance results on benchmark classification datasets.

The chart on the left shows the accuracy improvement of RIO with each round. On the right is information on trainer voting patterns. The first two rounds were basic reactions, whereas the rest is TIER voting. Some of the steepest jumps in accuracy were in those TIER rounds of reinforcement.

Reinforcements and your ML problem

Designed carefully, RL algorithms can converge to the global optimum. TIER reinforcements is not limited to news articles and topic/tag prediction. It could be applicable to any joint validation and classification tasks, ranging from validating automatic captions of images to judging the accuracy of audio transcription. The trainer has to only ๐Ÿ‘, ๐Ÿ‘Ž or add relevant topic :emoji: , no matter what the tag space โ€” and it will serve as reinforcement.

TIER also has three advantages โ€” (1) precise ground truth knowledge for each data point is captured, (2) something like ๐Ÿ• or ๐Ÿˆ or ๐ŸŽฅ is more expressive (and fun) than ๐Ÿ‘ or ๐Ÿ‘Ž , and (3) you can quantify the reward value towards reinforcement with the magnitude of votes. Even if some of your test data is synthetic, the reinforcements will be organic!

So, while its a good plan to start with supervised, plan to improvise with reinforcement ๐Ÿš€

Thanks to Eliza Bray โ€” Editor at Digg, who originally suggested & created the TIER emojis for topic tags, enabling us to capture granular reinforcements ๐Ÿ™

--

--