Better ML
Published in

Better ML

Multi stage ranking

Chaining classifiers has drawbacks like :

  • It makes overall system hard to debug. ML systems are probabilistic. Chaining multiple classifiers makes it hard to debug failures.
  • It is hard to maintain and iterate on. Simple data bugs can lead to a cascade of failures. Releasing new versions of features can lead to unintended changes and impact consumers.

Yet multi-stage ranking is quite prevalent in ranking/recommendation usecases. In this post we deep dive into one such system and explore why it exists in first place and best practices to maintain such a system.

Why multi stage ranking exists ?

  • Systems are often broken down in multiple stages. The stages provide a decoupled way of development and optimizing for different objectives. Similarly, large scale ranking/recommendation problems are broken down into series of machine learning problems.
  • Let’s take an examples : Typically a search ranking system will have 3 stages.
    1. Retrieval (with a focus on recall), 2. Relevance ranking (with a focus on optimizing for judged relevance), 3. Click and diversity ranking (with a focus on optimizing for direct user feedback). At each stage, there’s a separate ml system optimizing for a different metric but at the end they work as a cascade of rankers.
  • The ML systems at each stage also have different performance and latency constraints. Serving for ML model associated with retrieval has to be low latency since it has to applied to a large number of documents per query. Similarly, the final stage rankers such as click ranker is only applied to O(10²) documents for the final list and can be expensive.

Example systems :

Search ranking/recommendation :

Search ranking

Stage 1 (Retrieval model):

  • Problem formulation : Rank a list of candidates based on query — doc match prediction score. match_score = model(Q, D) where Q is features generated from user provided query and D is features generated from candidate document.
  • This runs on document index servers or on candidate generators. Thousands of documents being retrieved for popular queries is not uncommon.
  • Perf characteristics : Low serving latency, low memory consumption.
  • Training data here is usually human judged relevance data.
  • Choice of rankers : Gradient boosted decision trees with low number of high quality predictive features for fast serving. The features chosen at this stage are generally cheap to compute and stable. (No counter features / rpc calls to feature stores for aggregate stats etc).
  • It is a common practice to use a GBDT based ranker and a set of heuristic based off-topic filters. The heuristics should be interpretable and debuggable.

Stage 2 (Relevance ranking model) :

  • Problem formulation : Rank a list of candidates based on their predicted relevance for a given query.
  • The model used at this stage has been trained on logs of <query, doc> pair with label as human judged relevance. The training data is limited and expensive.
  • The model is run as a first stage ranker which deals with 100s or 1000s of documents selected by Stage 1. So, this still needs to performant.
  • Often the final ranking model used here is a stacked ensemble of multiple sub-models each optimizing for a separate objective : document quality, author quality, relevance, newsy-ness etc.
Example of a relevance ranker
  • The sub-models are trained on human judgement only affecting that attribute. For example, doc quality can be framed as a binary classification problem of classifying if documents are high quality or not. The labels are provided by human raters based on doc quality guidelines. Similarly for author quality.
  • The final ranker here can be simple MLP with a softmax layer predicting relevance. The ranker consumes numeric outputs from models such as doc quality, author quality, query doc text relevance, tensor outputs from document embedding, query embedding and so on. This is the cascade pattern.
  • The final ranker here should be trained on a held out set of query and documents which the sub-model rankers have not seen. Otherwise, we run a high chance of overfitting.
  • As a best practice all numeric outputs should be calibrated. If these are not calibrated, it will be very hard to update the sub-models.

Stage 3 (Click ranking model) :

  • Problem formulation : Rank a list of candidates based on their p(click) for a given query.
  • The final stage ranker which produces p(click) or probability of user interaction with a document based on query. Note that click can be replaced with any other user event.
  • We have multiple order of magnitude more data to train at this stage than relevance ranking. So, we can exploit memorization aspect of deep neural networks here. The choice of model here is wide and deep networks which can memorize on the sparse categorical features and generalize based on the dense continuous features.
  • Note with so many embedding representation in the model, the model is expensive to serve. Since it is a stage 3 ranker, it will deal with 10s — 100s of candidates and hence cost is manageable.
Wide & Deep Learning for Recommender Systems [1]

Best practices:

  • [Fast quality testing] Have fast offline tests checking metrics for each model. For example, it should be easy to compare relevant metrics for author quality model v2 vs author quality model v1.
  • [Data linting] Have data linting and feature distribution checks at each stage. Otherwise, a single feature can lead to a cascade of failures.
  • [Feature versioning] Use versioned features for feature releases.
  • [Calibration] If a model takes other model ouput as features, the model output must be calibrated.
  • All major model architecture changes should go through A/B test .
  • [Automated model refresh] Model monitoring and CI/CD will help in periodic model refresh on new data.


  • We saw how multi-stage ranking setup breaks up a complex problem like search ranking in multiple ml systems which can be iterated on in decoupled fashion.
  • How objectives, perf and latency influence our design of stages and models.


[1] Wide & Deep Learning for Recommender Systems :




Topics ranging from ml development, deployment and ops.

Recommended from Medium

What is a Citizen Data Scientist

How Will Developing Low-Income Neighbourhoods in Boston Affect NOx Levels?

Use Case: Pivot Tables in Excel

Teaching Sentiment Analysis in Middle School (Lesson Plan Included)

A proposal on opening Universities during COVID-19.

Why Should I Care About Data Scientists?

Should you Buy or Lease your Next Car? End-to-End Data Science Project

Assignment 15.02.2019

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jaideep Ray

Jaideep Ray

Engineer | ML lifecycle | ML Evaluation |

More from Medium

Exploring the Interpretability — Explainability Spectrum

F-Score Metrics

The New School: How Machine Learning Levels Up from Traditional Modelling

Create Trans-E, Trans-H Graph Embeddings by using Pykeen for Custom Dataset