Our recent 2019 ICML workshop paper talks about the most common exploration-exploitation dilemma that every ML practitioner faces: Should I exploit the one algorithm that works best in the real-world on average? or Should I explore other algorithms that show promise in an offline setting?
Authors: Naman Shukla(deepair), Arinbjörn Kolbeinsson(Imperial Collage London), Lavanya Marla(UIUC), Kartik Yellepeddi(deepair)
Multiple machine learning and prediction models are often used for the same prediction or recommendation task. In our recent work, where we develop and deploy airline ancillary pricing models in an online setting, we found that among multiple pricing models developed, no one model clearly dominates other models for all incoming customer requests. Thus, as algorithm designers, we face an exploration — exploitation dilemma. In this work, we introduce an adaptive meta-decision framework that uses Thompson sampling, a popular multi-armed bandit solution method, to route customer requests to various pricing models based on their online performance. We show that this adaptive approach outperform a uniformly random selection policy by improving the expected revenue per offer by 43% and conversion score by 58% in an offline simulation.