Multi stage ranking
Chaining classifiers has drawbacks like :
- It makes overall system hard to debug. ML systems are probabilistic. Chaining multiple classifiers makes it hard to debug failures.
- It is hard to maintain and iterate on. Simple data bugs can lead to a cascade of failures. Releasing new versions of features can lead to unintended changes and impact consumers.
Yet multi-stage ranking is quite prevalent in ranking/recommendation usecases. In this post we deep dive into one such system and explore why it exists in first place and best practices to maintain such a system.
Why multi stage ranking exists ?
- Systems are often broken down in multiple stages. The stages provide a decoupled way of development and optimizing for different objectives. Similarly, large scale ranking/recommendation problems are broken down into series of machine learning problems.
- Let’s take an examples : Typically a search ranking system will have 3 stages.
1. Retrieval (with a focus on recall), 2. Relevance ranking (with a focus on optimizing for judged relevance), 3. Click and diversity ranking (with a focus on optimizing for direct user feedback). At each stage, there’s a separate ml system optimizing for a different metric but at the end they work as a cascade of rankers.
- The ML systems at each stage also have different performance and latency constraints. Serving for ML model associated with retrieval has to be low latency since it has to applied to a large number of documents per query. Similarly, the final stage rankers such as click ranker is only applied to O(10²) documents for the final list and can be expensive.
Example systems :
Search ranking/recommendation :
Stage 1 (Retrieval model):
- Problem formulation : Rank a list of candidates based on query — doc match prediction score. match_score = model(Q, D) where Q is features generated from user provided query and D is features generated from candidate document.
- This runs on document index servers or on candidate generators. Thousands of documents being retrieved for popular queries is not uncommon.
- Perf characteristics : Low serving latency, low memory consumption.
- Training data here is usually human judged relevance data.
- Choice of rankers : Gradient boosted decision trees with low number of high quality predictive features for fast serving. The features chosen at this stage are generally cheap to compute and stable. (No counter features / rpc calls to feature stores for aggregate stats etc).
- It is a common practice to use a GBDT based ranker and a set of heuristic based off-topic filters. The heuristics should be interpretable and debuggable.
Stage 2 (Relevance ranking model) :
- Problem formulation : Rank a list of candidates based on their predicted relevance for a given query.
- The model used at this stage has been trained on logs of <query, doc> pair with label as human judged relevance. The training data is limited and expensive.
- The model is run as a first stage ranker which deals with 100s or 1000s of documents selected by Stage 1. So, this still needs to performant.
- Often the final ranking model used here is a stacked ensemble of multiple sub-models each optimizing for a separate objective : document quality, author quality, relevance, newsy-ness etc.
- The sub-models are trained on human judgement only affecting that attribute. For example, doc quality can be framed as a binary classification problem of classifying if documents are high quality or not. The labels are provided by human raters based on doc quality guidelines. Similarly for author quality.
- The final ranker here can be simple MLP with a softmax layer predicting relevance. The ranker consumes numeric outputs from models such as doc quality, author quality, query doc text relevance, tensor outputs from document embedding, query embedding and so on. This is the cascade pattern.
- The final ranker here should be trained on a held out set of query and documents which the sub-model rankers have not seen. Otherwise, we run a high chance of overfitting.
- As a best practice all numeric outputs should be calibrated. If these are not calibrated, it will be very hard to update the sub-models.
Stage 3 (Click ranking model) :
- Problem formulation : Rank a list of candidates based on their p(click) for a given query.
- The final stage ranker which produces p(click) or probability of user interaction with a document based on query. Note that click can be replaced with any other user event.
- We have multiple order of magnitude more data to train at this stage than relevance ranking. So, we can exploit memorization aspect of deep neural networks here. The choice of model here is wide and deep networks which can memorize on the sparse categorical features and generalize based on the dense continuous features.
- Note with so many embedding representation in the model, the model is expensive to serve. Since it is a stage 3 ranker, it will deal with 10s — 100s of candidates and hence cost is manageable.
- [Fast quality testing] Have fast offline tests checking metrics for each model. For example, it should be easy to compare relevant metrics for author quality model v2 vs author quality model v1.
- [Data linting] Have data linting and feature distribution checks at each stage. Otherwise, a single feature can lead to a cascade of failures.
- [Feature versioning] Use versioned features for feature releases.
- [Calibration] If a model takes other model ouput as features, the model output must be calibrated.
- All major model architecture changes should go through A/B test .
- [Automated model refresh] Model monitoring and CI/CD will help in periodic model refresh on new data.
- We saw how multi-stage ranking setup breaks up a complex problem like search ranking in multiple ml systems which can be iterated on in decoupled fashion.
- How objectives, perf and latency influence our design of stages and models.
 Wide & Deep Learning for Recommender Systems : https://arxiv.org/pdf/1606.07792.pdf