Deep Reinforcement Learning for Optimal Stopping Problems

Published in

DeepMetis

11 min readDec 30, 2020

Sequential Decision-Making Processes

Reinforcement learning (RL) and in particular Deep RL are powerful tools for tackling sequential decision makings: the processes that at each point in time you have to choose between a set of actions (i.e. make a decision) and each action leads to a new state from which you have to make a decision again. Associated with each state and action is a reward/cost and the total reward of the process is a (discounted) sum of all those intermediate rewards. The goal is to choose a sequence of actions such that the total reward/cost is maximized/minimized. Examples of such sequential decision-making processes are abundant in logistics, economy, etc.

Let’s illustrate the sequential decision-making problems with a daily example. Imagine that you are at home and you want to reach your work:

At each junction, you have to choose a road, and each road takes you to a new junction. Finally, after a number of choices, you would arrive at the destination.
At each junction depending on the road you take, you get a reward. The reward is highly correlated with the objective of the process. The reward could be negative (for example the cost of the consumed fuel), or positive (for example when you get to your destination a positive reward is given).

Using reinforcement learning you can find the optimal path, e.g the quickest path, the shortest path, the most energy-efficient path, etc.

Optimal Stopping Problems

In this article, we focus on a category of the sequential decision-making processes, which are less explored in previous posts: the optimal stopping processes.

The characteristic difference between the optimal stopping strategy and the other decision-making processes is the following: one of the actions that we can choose is an action that ends the process, and the goal is to figure out when to take that particular action such that the reward is maximized (or the costs are minimized). Examples of optimal stopping problems can be found in economic, financial mathematics, etc. Let’s clarify this type of problem with a key example: “the secretary problem”.

The Secretary Problem

Picture from ScottHYoung.com Services Ltd.

Let’s imagine that there is an open position in your company and there are a known number of candidates for this job; obviously the goal is to choose the best candidate. This seems to be an easy decision to make if we know how good are the candidates, but there is a twist in the formulation of the secretary problem:

For each candidate that you interview you have to either (i) accept the candidate and the process ends immediately, or (ii) reject the candidate and proceed to the next one. In this case, you lose your chance of hiring that person forever.
If you do not accept anybody, the last candidate is automatically hired.

It is also assumed that:

You interview the candidates one by one (let’s say from a list), and
There is no correlation between the position of a candidate in the list and its qualifications.

This problem falls into the category of optimal stopping problems: by choosing the action of “accepting”, the process terminates; and our goal is to find when and under which condition we should take this action.

What is the optimal strategy for this problem? In other words, when is the right time to stop the process such that we maximize our chance of hiring the best candidate? As the candidates appear randomly in the process, no strategy guarantees to hire the best candidate. Nevertheless, there are ways to increase our chance of hiring the best candidate! Indeed, we show that following the optimal strategy (which is explained below), on average one can hire a candidate in the top 20% (in a typical hiring scenario where there are 15–20 candidates).

The Optimal Strategy

As a key representative optimal stopping problem, the secretary problem has been tackled with different approaches. Here, we present the optimal strategy (without the proof of optimality) and present an intuition on why this could be the optimal solution.

The optimal strategy, for the secretary problemA, follows 1/e-law of best choice. Assuming that the total number of applicant are N:

For the first n = 1/exp(1) * N ~ 0.37 N applicants, you reject them all (irrespective of their scores). You only keep a note of what was the maximum score.
For each next applicant, if the applicant scores higher than the noted maximum score, you accept that applicant, otherwise reject it.

You can find the proof of the optimality (under even more general conditions) in Ref. [1], and a detailed construction of the optimal solution in Ref. [2].

Intuitively, we can understand why such a method could be optimal. Image that you are the recruiter. At the beginning of the process, you do not know anything about the scores of the applicants. At this point, if the first applicant is evaluated and gets a score of let’s say 5.9, it does not help much for decision making, because we do not know if this is a good score or a bad score; if the applicants are in range [0, 6], we have a good candidate. If the range of the scores is [0, 100], one can argue that a candidate with score 5.9 is not a good candidate.

To figure out what scores are available in the market (i.e. to gather some information about the range of the applicants), we should explore the market. This can be done by interviewing some candidates (let’s say we interview n out of N candidates). Of course, we have to reject all of them, otherwise we can’t interview up to the n-th person. This way, we can gather some information about the distribution of the candidate’s scores. If we have that information, then we can judge how good/bad are the next candidates.

An important question would be what is the best value for n if we have N candidates in total? There are two competing factors here. On one hand, the larger the value of n, the more accurate will be our information about the score’s distribution. On the other hand, for large values of n, not many candidates remain to be chosen from, which in turn reduces the chance of finding the best candidate. One can show, that the optimal value of n is exp(-1) * N.

Finding the Optimal Strategy Using Monte Carlo Simulations

In this part, we test the performance of the strategy explained above for different stopping points. First, we focus on performance with respect to the objective of increasing the chance of finding the best candidate, and then later we change the objective to increasing the average score of the hired candidates.

To find out the optimal performance (of each objective), we follow a typical Monte Carlo evaluation scheme. In our Monte Carlo scheme, for each stopping point, we simulate many (random) instances of the hiring process and measure the performance. Then by comparing the performance for different stopping points, we find the optimal stopping point and the corresponding performance.

The performance in the case where the objective is to find the best candidate in each hiring is simply calculated as the number of hiring processes which resulted in hiring the best candidate divided by the total number of hirings.

Here is the performance as a function of the stopping point for a hiring scenario with 20 candidates.

The result of the Monte Carlo simulation; the probability of finding the best candidate as a function of the stopping point.

Let’s examine some features of the above results before going further.

The best performance is achieved for stopping points 6–7; this is in line with the 1/e law.
The worst performance is achieved for the largest stopping point; i.e. we do not take any decision and the last applicant is chosen. Intuitively, this is also not surprising; we did not do anything to make a better decision. The probability of choosing the best candidate is 1/N here, a completely random process.

It is very interesting to point out that the optimal stopping changes as we change the objective of the optimization. Here, we turn into our second objective, i.e. to maximize the expected value of the hiree. The optimal stopping point changes as shown in the graph below, and interestingly the optimal performance corresponds to hiring the candidates in the top 20% !

The result of the Monte Carlo simulation; the average score of the hiree as a function of the stopping point.

Using Deep Reinforcement Learning for the Secretary Problem

Here, using an RL approach, we tackle the secretary problem. In our approach, we consider the variation of the secretary problem which we alluded to in the previous section. In this variation, the goal is to maximize the average score of the hired candidates (and not the chance of finding the best candidate). The solution to the original secretary problem can be found in Ref. [3].

Here, we skip a detailed introduction to the deep RL approaches (comprehensive introductory posts can be found in Refs. [4–6]). We just introduce the basic concepts/terminology for a deep RL approach, without getting into the mathematical details.

(Hand-waving) Introduction to RL Approach or “Where Does RL Come into Play?”

The general picture of a RL approach consists of two entities: an agent and an environment. The agent, which is immersed in the environment, makes an observation, and takes a decision based on that observation. Depending on the chosen action, the environment rewards the agent with an “immediate” reward and also changes the state of the agent. Now that the agent is in a new state, it takes the next decision based on what it observes and the process goes on. In an RL problem, the goal of the agent is to maximize the (discounted) sum of all the immediate rewards. In other words, the agent should find the best actions from each state, such that the total gained rewards over time is maximized.

Given the nature of the problems RL deals with, one can clearly see that RL is a suitable tool for sequential decision-making processes. It actually has proven to be an effective tool for this type of problem, especially when embedded in the deep learning framework.

Deep RL Approach or “What Is Deep in Deep RL?”

In a Deep RL approach, roughly speaking, the decisions are taken using deep neural networks. One can imagine a deep neural network as the brain of the agent: by feeding the observation to the neural network, it outputs the agent’s action. By integrating deep neural networks (DNN) into the agent, one can increase the capabilities of the RL framework. The agent can benefit from the strong interpolation power of DNNs together with their generalization capabilities, which can open doors to new applications especially when the states are high-dimensional vectors (for example where the inputs are images).

Schematic representation of a deep neural network in a deep RL framework [3]. The neural network takes an image as its input and outputs an action.

In a typical Deep RL solution, initially, the neural network is untrained and has random weights and biases (similar to the supervised learning approach).

Then the agent is immersed in a (simulated) environment where it takes sequential decisions based on its neural network. Of course, at this stage, the decisions that the agent takes are not the best ones; but the agent “learns from the experiences”. The rigorous mathematical algorithms behind this “learning” process are beyond the scope of this article. Nevertheless, in general, one can say that the agent trains its own neural network to make better decisions by favoring the decisions which led to successful experiences and vice versa for the failures (hence “reinforcement” learning).

Although we do not dive deep into the RL learning algorithms, we use our setup of the secretary problem as an example of choices that we need to build an RL framework. To set up a deep RL solution, one should make a number of decisions:

The state — the state represents the information that determines in which situation the agent is so far. The choice for what composes the state needs a good insight into the problem. Generally, irrelevant information, for example, the weather in case of the secretary problem, should not make it into this list. For our problem, we chose the state to be a tuple with the information of how many candidates have been interviewed so far, what was the score of the best candidate so far, and what is the score of the current candidate.
The action space — this is simply the list of the possible actions that the agent can take from each state. In the secretary problem, the agent can either hire or reject.
The reward — the immediate rewards that the agent gets. In our case, it is zero for each rejection and the score of the candidate for accepting that candidate. Unlike the original secretary problem, this reward system is a Markovian one.
The architecture of the neural network — There is no clear guideline for this decision, similar to the architecture design for networks in the supervised learning approach. Nevertheless, the experience from the supervised learning approach could be transferred to RL based on what networks are capable of learning the features of the problem.
The RL algorithm — This is the algorithm with which the agent is going to learn from the experiences it collects. The choice is partially imposed by the type of state and action information (for example if they are discrete or continuous), by how computationally expensive are the simulations (sample efficiency), if the agent can learn from experiences of the others (on/off policy learning), etc. For our problem, we chose the DDQN approach.

For this project, we have developed an environment which is compatible with OpenAI Gym environments.

Results

Here we show the improvement of the agent’s performance, as the training rounds proceed. The candidates have random scores uniformly distributed between 0 and 1.

The performance of the agent as a function of the learning iterations. For each learning iteration, 50 independent tests are performed and the average is plotted (the gray line). The purple line is the moving average of the original data.

In the beginning, the agent takes actions completely randomly. In this stage, the performance, (i.e. the average score of the hired applicants) is 0.5. This is expected as this is the average of all applicants, and the agent hires randomly. As the training progresses, the agent picks better candidates, as depicted by the increase of the performance. At the end of the training, the agent hires with an average score larger than 0.9 (for the case where there are 20 candidates).

Remarkably, the final performance of the agent is better than 0.8, which was the maximum performance we got using the Monte Carlo simulation. This can be traced back to the fact that during the training the agent not only learns what is the right time to stop, but also learns what is the range of the candidates available on the market. The latter is the information that is assumed not to be available while deriving the optimal policy analytically.

The environment and also the RL solutions are available for further experimentation:

The secretary environment, here, and
DDQN solution for the modified secretary problem, here.

If you want to learn more about our work at DeepMetis check out our blog or contact us at hi@deepmetis.com

Nima is an active researcher and Machine Learning Team Lead at DeepMetis, an impact-driven AI and Quantum Computing research firm based in Berlin.