Recommendation system behind Toutiao news App
Toutiao is the largest mobile news and content aggregator in China, with Toutiao news app and TikTok video app as their main products. Both apps have addictive yet a bit disputable newsfeed algorithm. Recently Toutiao revealed how it does newsfeed recommendation and here is my translation. It’s not very well structured because it’s originally tech talk companion with slides.
Recommendation system can be described as a function to simulate user’s satisfaction with content under certain context, formally as
Satisfaction=F(content, user, context)
The first factor is content. Toutiao is a compound platform with articles, videos, UGC short videos, Q&As etc. Each with its own features that needs be extracted for recommendation. The second factor is user’s profile, including interests, occupation, age, gender and some latent patterns based on user’s behaviour. The third factor is context. In a mobile centric era, users are moving constantly from office to home to vacations and their preference of information also changes accordingly.
Combining those three factors, recommendation system will estimate whether it’s appropriate to recommend an item to a user given the context.
But how to evaluate its accuracy?
In recommendation system, click through rate, stay time, upvotes, comments and reposts are metrics that can be used for offline training and online verification. But those quantitative metrics are not enough for a large scale platform with huge user base. It’s important to introduce some other knobs.
- Frequency control of ads and special content
- Frequency control of vulgar content
- Depress clickbait, low quality, disgusting content
- Manual insert content to the top/middle, or adjust priority
- Lower priority for low level content producer
Content like Q&A is not just for reading, it also needs user’s contribution. We need to be careful on how to mix those content with others and control their frequency.
Back to the previous formula, it is a classic supervised learning problem. There are many ways to solve it, such as collaborative filtering, logistic regression(LR), deep neural network(DNN), factorization machine and GBDT etc.
There is no single algorithm that can handle all the cases for recommendation system. Facebook used LR and GBDT in combination couple of years ago and now it’s prominent to ensemble LR and DNN. Several products under Toudiao are using the same recommendation system with some tweaks according to its use case.
- Correlation, between content’s characteristic and user’s interest. Explicit correlations include keywords, categories, sources, genres. Implicit correlations can be extract from user’s vector or item’s vector from models like FM.
- Environmental features such as geo location, time. It’s can be used as bias or building correlation on top of it.
- Hot trend. There are global hot trend, categorical hot trend, topic hot trend and keyword hot trend. Hot trend is very useful to solve cold-start issue when we have little information about user.
- Collaborative features, which helps avoid situation where recommended content get more and more concentrated. Collaborative filtering is not analysing each user’s history separately, but finding users’ similarity based on their behaviour by clicks, interests, topics, keywords or event implicit vectors. By finding similar users, it can expand the diversity of recommended content.
Most of our recommendation systems are running realtime training. Comparing to batch training, realtime training can reduce cost of resources. It also responses faster to changes which is essential to content feed because it captures users’ behaviour and feedback to next round of recommendation quicker. Our production system is based on a storm cluster that process action events such as click, display, add favourite, share.
We built a high speed system internally for model parameter service as our data scale is growing too fast that exiting open source solutions cannot provide such service with high availability and performance. Our proprietary system has lots of low-level system optimisation that’s specifically designed to tackle our use case, meanwhile providing more sophisticated operation tools.
So far, Toutiao’s recommendation model has tens of billions raw features and billions of feature vectors. As for training process, realtime events are first tracked and piped to kafka queue. Then a storm cluster consumes data from kafka together with labels collected from our app to generate a new batch of training input. An online training process takes those input data and update model parameters and publish changes to production system. The major delay in this process is user might not have time to browse recommended content. Except for that, the whole system is almost fully realtime.
Toutiao’s has a huge volume of content, with tens of millions of videos. Our system cannot process all the content for each recommendation. We have to design a recall strategy that selects only thousands of candidates. One key requirement for recall strategy is speed, that it need to finish within 50 ms.
There are many ways to do candidate selection. Our main strategy is reversed indexing. A reversed index is maintained offline with keys like category, topic, entity, sources. Within each key, items are sorted based on trend, freshness, users’ actions etc. During online recall, the top items from all the keys that matches with target user is selected as candidates. This efficiently filters contents from a large base.
2. Content analysis
Content analysis consists of text processing, image process and video processing. Toutiao started with news service, so we’ll focus on text processing here. One important part of text processing is building users’ profile. Without labels for text, we cannot know labels users interested in. For example, we need text processing to first find out the label for one article is “Internet”, then we can say users that read this article are interested in “Internet”.
On the other hand, text labels can be used directly as a feature. For example we can simply recommend content about “Meizhu” to users that subscribed to “Meizhu” label. Another use case is if a user is tired with the main content feed, he/she can switch to specific channel (science, sport, entertainment, military). After that when he/she switches back to main feed, our recommendation will be more accurate. Since we are using a unified model for main and sub-channels, with a small search space in sub-channel, we can provide more relevant contents in sub-channels. It’s difficult to have high quality recommendation with main feed alone. Therefore it’s important to have good sub-channels utilising text processing.
Above is an example from our system. By search an article ID, it shows article’s category, keywords, topic, entities. Of course, text processing is not a must-have for recommendation systems. For instance, in early days, Amazon, Walmart, or even Netflix built their recommendations directly with collaborative filtering without text processing. However, for newsfeed products, most users only read content generated on the same day. Without text processing, it’s hard to solve cold start problem for new content which is a common issue for collaborative filtering.
There are two types of text labels. First is semantic labels explicitly added to article. Those labels are manually added with clear meaning and pre-defined specifications. Second is implicit semantic labels derived from topics and keywords. Topic labels have no clear meaning but representing certain distribution of words. Keywords labels are based on some unified rule applied on set of keyword features.
Moreover, text similarity is also very important. One common issue from our user feedbacks is some similar articles are recommended repetitively. This is hard to address because everyone has their own standard of repetition. For example, someone that read one article about “Real Madrid CF” yesterday might feel it’s repetitive article if another article about “Real Madrid CF” is recommended today. But for a hardcore soccer fan, they just want to read every news about their teams. We have to process similar articles’ topic, writing style and entities to adjust recommendation strategy online.
Similarly, time and geolocation of event in the article are also useful. As it doesn’t make sense to recommend an article about traffic issue in Wuhan to a user in Beijing. Lastly, text processing can also identify inappropriate articles like vulgar, NSFW, advertorial and pep talk articles.
We have three types of explicit semantic labels. They are category, definition and entity, each with different levels and requirements.
- The objective of categorical labels is coverage. We want every article or video has a category.
- On the other hand, entity labels doesn’t have to cover every article. Entity labels need to be precise that even the same name or phrase might pinpoint to different person or object under different context.
- Definition labels are responsible for semantics that are relative precise but also abstract.
Those three type of labels are our initial design, but later on we combined category and definition into one type when we find out they can share the same technical solution.
Right now, implicit semantic features are already good enough for recommendation systems, while explicit labels need manual operations and new entities and definitions are born everyday. Then why are we still spending resources on explicit labels while it’s easier to get implicit semantic features? One reason is sub-channels and users’ subscription on labels need explicit labels. The key evaluation of a company’s NLP level is the quality of explicit labels.
Categories in Toutiao’s production system are generated with hierarchical text classification algorithm. From root, it could have science, sport, finance, entertainment as its children. And sport is further classify into soccer, basketball, pingpong, track and field, swimming etc by its own classifier. Soccer is then divided into international soccer and Chinese soccer which is then divided into CLO, CSL, National with their own classifier. Comparing to individual classifier, hierarchical classifier mitigate data bias issue. Although, it’s not a strict tree structure as sometimes we need to have links between nodes at different part of the hierarchy to improve recall. This architecture is a generic solution but needs some adjustment for use cases with different complexity by adopting heterogeneous classifier at each node. For example some nodes are using SVM while some others are SVM plus CNN and the rest are SVM + CNN + RNN.
To identify entities, there are four steps:
- Tokenization and PoS
- Candidate word selection
- Calculating similarity
In the first and second step, we might need to join words with some external knowledge base because some entities are a combination of several words that we need to stick them together. If there are multiple entities after first two steps, we’ll calculate word vector, topic distributtion, word frequency to run a disambiguation algorithm and then estimate their similarity to each entity.
3. User labelling
Content analysis and user labelling are two fundamentals for recommendation system. Content analysis is focusing on machine learning while user labelling is more challenging in terms of engineering.
At Toutiao, there are three types of user labels.
- Interest: category, topic, keyword, source, community, vertical
- Profile: gender, age, geolocation
- Behaviour pattern: for instance, only watch video at night
where gender can be retrieved from user’s social media account and age is normally estimated from user’s behaviour. Geolocation is gathered from user’s mobile device and can be later on used to guess user’s home, office, business trip, vacation place which is useful to improve recommendation performance.
How to get users’ interest label? One simple method is just extract labels from articles in users’ reading history. But there are several tricks to get it right.
- Noise canceling, filter out articles with very short browsing time that are likely to be clickbait.
- Penalty on trending article.
- Time decaying, older browsing history has lower weight than newer ones.
- Penalty on display, if one article is recommended but not clicked, its related features such as category, keyword, source will have lower weight.
- Global bias, L1norm on average click rate of each user type.
- Unsubscribes, dislikes.
It’s relative simple to do users labelling, except for the above engineering difficulties. Our first version of user labelling is batch processing two months’ activity history of previous day’s active users on hadoop cluster.
However, with an rapid growth of our user base, increase of user label models and other batch tasks, the daily computational burden got heavier and heavier. Since 2014, batch processing millions of user labels can barely finish within one day. It not only blocked other tasks, but also increased writing load of our distributed storage and increased latency on updating users’ label.
Facing those challenges, we deployed storm cluster to run streaming computing instead of batch computing. It saving 80% of CPU time and other resources while updating user’s label in realtime. With only tens of servers it can process tens of millions users’ event stream and continued running till now. Noted it’s not necessary to run streaming update on some labels such as gender and age thus batch processing is still retained.
After building a recommendation system, how to evaluate its performance. “If you cannot measure it, you cannot optimise it”, I think it’s a very wise saying. There are too many optimisations we can do to improve recommendation performance.
- Improve content base
- Updating candidate selection strategy
- New features
- Upgrade system architecture
- Update model parameter
- Update rules
Evaluation is important because not all optimisations leads to positive results, offline or online. What we need are
- A complete scheme of evaluation measurement
- A strong experiment platform
- Handy experiment analysis tools
We cannot rely on a single metric such as CTR or stay time. Over the past years we were trying to combine all metrics into a single one but it’s still in progress. For now, whether to deploy a new change to production are decided by a committee formed by our senior members.
A good scheme of measurement should adhere to several principles.
- First, it needs to consider both short term and long term metrics. My experience on e-commerce teaches me that lots of gimmick can stimulate users in the short term but cannot provide any long term benefit.
- Secondly, it needs to take both user KPI and eco-system KPI into consideration. Toutiao as a content platform should not only concern profit to content generators and make them feel proud, but also satisfy content consumers, as well as balancing interest of advertisement agencies which is another tradeoff to make.
- Lastly, measurement should mitigate influence from collaborative effect. It’s hard to have complete isolation between controlled groups because of external factors.
Advantage of a strong experiment platform is when there are multiple running experiment, it can auto distribute and recycle traffic to each experiment without manual intervention. This reduces experiment costs and boost algorithm iteration speed which pushes the whole optimisation process much faster.
To run A/B test, we’ll put users into user buckets offline and assign buckets to experiments online with experiment label attached to each user. For example, to run an experiment with 10% of traffic, two experiment groups will be generated. 5% traffic running with the same strategy as production and 5% running with experimental strategy.
We then collect and process behaviour logs of users in the experimental buckets in semi-realtime and show it in a dashboard. What engineers does is just decide experiment traffic volume, experiment time range, special filters of users and an experiment ID. Experiment platform will automatically handle data comparison, confident interval, summary and optimisation suggestions.
Still, automated experiment platform is not enough. It can only process quantitative metrics which are not the whole story. For important changes, we nevertheless need manually re-evaluation.
5. Content security
Since the beginning of Toutiao, we have put content security into our top priority. There is a dedicated auditing team in charge of content auditing, even when we only had 40 engineer in mobile app, backend and algorithm.
Nowadays, we have two main sources of content, PGC and UGC. For PGC with relative low number of articles, we’ll pushed it to majority once it passed risk auditing. As for UGC, we first evaluate it with a risk model and then canary to small traffic volume to gather feedbacks. Both UGC and PGC content will be remove if negative feedbacks reached certain threshold.
Our risk model mainly consists of porn detection, cursing detection, vulgar detection. Vulgar detection are base on deep learning with huge sample data of both image and text. And it leans towards recall instead of precision. The same goes for cursing detection which has 95% recall and 80% precision. If a user is frequently identified by our model, his/her content will also get penalty in recommendation.
We also have a model to detect low quality content like fake news, defame, mismatched title, clickbait and so on. Those kind of detection is really hard for machines as it requires large number of feedbacks and external information. For now our model doesn’t have high recall nor precision. But combining with manual auditing, we brought its recall up to 95%. There is a huge space for improvement in this area. Now Professor Li Hang in our AI lab is working with Michigan University to set up a rumour detection platform together.
In short, if you want to build a state of art content recommendation system, you can try out the following stuff.
- Content features: NLP+labels
- User features: collaborative filtering + device/account info
- Candidate selection: offline revered index
- Model: LR+DNN
- Data processing: Storm + Hadoop
As for the tricks and tweaks, every use case has its special needs. What works for successful big companies doesn’t necessarily means it’ll work for us. But we can always learn from them.