Sanjay Hariharan — Data Scientist — QuantumBlack
This is part two in our series exploring QuantumBlack’s NeurIPS 2019 expo session. Our previous article covered the session’s first half, detailing how practitioners can deploy Causal Inference to build models that place greater emphasis on cause and effect.
This article focuses on the session’s second half, describing how Reinforcement Learning can be similarly adopted to identify the best intervention among an array of potential actions.
What is Reinforcement Learning?
Reinforcement Learning is an area of Machine Learning concerned with identifying the best sequence of actions in a complex environment.
This environment is typically formulated by a Markov Decision Process (MDP), a classical formalisation of sequential decision marking named for the Markov Property: a system’s future is determined only by its present state and not on the sequence of events that preceded it.
The schematic above illustrates this agent-environment interface, where actions influence not just immediate rewards, but also subsequent states and future rewards. Each element can be described:
State: Represents the current condition of the environment, which can be encoded as a set of features that change over time.
Action: a set of a finite choices, a set of continuous decisions, or a mixture of discrete and continuous decisions that the action must choose from. An action should impact the environment the problem is in and should ultimately influence a change in state and output some form of reward.
Reward: A single value or number that measures the immediate reward after the last action. A crucial piece of the RL problem, as it defines the objective the algorithm will optimise on in the long run.
Agent: The learner and decision maker that sees the current state and reward, takes an action that impacts the environment, and creates a new state and reward. It can also be defined as the process that generated the data, but in RL it is also the entity trying to learn the best set of possible actions.
Environment: The system in which the agent operates, and where the problem resides. Agent and environment interact continually, with the agent selecting actions and the environment responds to these actions and presents new situations to the agent. The environment also gives rise to rewards.
Policy: The sequence of actions taken by the agent. RL seeks to identify the optimal policy i.e the sequence of actions that maximises our long term reward. Not to be confused with the behavior or data policy, the sequence of actions taken to generate the historical dataset used when running RL on retrospective data.
Classification of RL Algorithms
Reinforcement Learning methods can be classified in the following ways:
Model-Based vs. Model-Free:
Model-based algorithms model the dynamics of the environment by learning the transition probability from the current state and action to the next state. As we mentioned in our previous article, causal inference can separately recover appropriate models for this set of methods or serve as simulation environments for policy evaluation.
On the other hand, Model-free methods learn directly from experience. The underlying dynamics of the system are usually unknown and instead rewards are collected through trial-and-error and the model is updated in real-time.
On-Policy vs. Off-Policy:
An on-policy agent learns directly from its own actions and policies, improving actions gradually. Its off-policy counterpart learns an optimal policy based on data obtained from another agent.
Online vs. Offline:
Online learning algorithms are trained as data is made available in real-time. These methods improve gradually and can shift over time as the incoming data changes. Offline methods, on the other hand, learn from data in a batch setting, and as such can be applied to static datasets. Either can be run alongside on/off-policy methods.
Evaluation of Reinforcement Learning Methods:
As with traditional Machine Learning methods, evaluation of the final Reinforcement Learning policy is a crucial part of the model-building process. An important part of a data scientist’s role is to understand how confident we are in our model recommendations.
Any proposed policy must recommend a set of actions which an organisation can realistically implement. We may work in a domain that cannot implement actions that differ too far from their current behavior. If our recommended actions are unrealistic, inappropriate actions could either be dropped from the action set or assigned a large negative reward. Furthermore, interpretation of how our algorithm’s policy differs from current behavior could be extremely powerful in encouraging and accelerating adoption throughout the wider organisation.
At QuantumBlack, we have found that evaluation of Reinforcement Learning methods is the most challenging aspect of this technique. In practice, it is very difficult to model the dynamic of the environment in most use cases, and as such we cannot accurately estimate how our policy would perform compared to baseline. Though formal statistical methods have been developed to assess the quality of new policies when an environment model is not available, these methods rely on strong assumptions and are limited by their mathematical properties.
Deploying Reinforcement Learning
Consider a product manufacturer which employs sales representatives to pitch their brand to stores around their region. These reps use a variety of marketing techniques, methods and channels depending on their operating region and target store. Using Reinforcement Learning, we can learn which marketing method a sales rep should employ when targeting a particular store.
Building an environment model:
Given the challenge of evaluating our model recommendations, we first attempt to build of a model of the marketing environment dynamics. Specifically, our model attempts to understand the relationship between a sales rep applying a certain marketing technique and the store stocking its product.
To do so, we constructed an Input-Output Hidden Markov Model to infer the underlying dynamics of the process.
We fitted this model in a Bayesian framework so that we could recover a generative model. This can serve as a simulation environment for evaluating any trained RL algorithm.
Finding the optimal policy:
We experimented with several different types of models, but ultimately settled on Fitted Q-Iteration, a variant of Q-Learning methods. This approach offers the best balance between accuracy, speed and interpretability.
In this particular example, we uncovered an interesting marketing strategy. Specifically, we had a metric called network rank, which measures the degree in which a store is connected to other stores through sales reps and customers. We found that our policy recommends increasing marketing activities overall for stores with a larger network rank, highlighting that these stores influence each other and are not completely independent as previously thought.
By deploying our recommended policy on the generative model of the environment, we are able to estimate that the new policy generates more long term sales compared to baseline — an exciting commercial prospect for the manufacturer.
RL is a highly useful technique for identifying the right strategy in a given state, helping data scientists deliver the right analytical intervention that directly or indirectly maximise a long-term reward.
If you are interested in further exploring how to drive effective interventions, we recommend you check out our latest open source offering, CausalNex. This software library provides a far more streamlined, end-to-end process and considers causality to avoid spurious conclusions.