How Multi-Armed-Bandits can help us in the fight for Covid-19 — Part 1

Published in

Analytics Vidhya

5 min readMar 25, 2020

I’m an AWS ML Hero and I was thinking how Reinforcement Learning and AWS can help to make the distribution of Covid-19 test kits in a more effective and efficient manner. Why? Identifying as many people who are infected as possible looks to be one of the key things to minimize or stop the spread of Covid-19. This looks to be one of the reasons why South Korea managed to flatten the curve.

Source: http://91-divoc.com/pages/covid-visualization/

So, let’s define first what Multi-Armed-Bandits (MABs) are. First of all it’s a Reinforcement Learning methodology. You’re faced repeatedly with the problem of choosing among k different options or actions or arms. After each choice/decision you receive a numerical reward. Each arm’s reward is associated with a stationary probability distribution which is unknown. Your objective is to maximize the expected total reward over some time period or discrete time steps.

Some Multi-Armed-Bandit Examples:

A doctor choosing between experimental treatment for a series of seriously ill patients (from Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto)
An ad system choosing which ad to show on which website
A recommender system choosing which articles to show on the home page

Given all this information, I’m pretty sure you have already started shaping in your heads a Covid-19 MAB problem definition. Here’s the problem definition:

There are M areas. It’s similar to the boroughs (Camden, Islington, Westminster, etc) in London, for example.
We assume that given the current lockdowns, people don’t move between areas/boroughs or there is a very small percentage of them that have to.
There is a central factory between these areas that manufactures Coviv-19 test kits in an ongoing basis.
Whenever a new test kit is produced, a choice has to be made which area/borough it should be delivered to.
So, areas are the arms of this MAB problem and the objective is to maximize the number of true positives.

I know that this problem doesn’t align with all the theoretical assumptions of MABs but the main purpose of this blog post is to spark new ideas!

The main assumption of this problem definition is that it assumes that each area/borough has stationary probability distributions for true positives. Probably, they’re not but later on in another part, we’ll address this issue!

Let’s try out a simple MAB solution, Epsilon-Greedy or E-Greedy. Here’s the pseudocode (source: Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto):

ε <- user_input()  // In range (0.0, 1.0). User defines this.
Initialize, for a = 1 to k:  // k is number of arms
   Q(a) <- 0  // estimated rewards per arm a
   N(a) <- 0  // number of times arm a is chosenLoop forever:
   random_number <- random()  // random number in [0.0, 1)
   if (random_number < ε):  // exploration part
      A <- choose a random arm a
   else:  // exploitation part
      A <- argmax Q(a)  // choose arm with highest Q(a) so far
   
   R <- bandit(A)  // receive reward for chosen arm
   N(A) <- N(A) + 1
   Q(A) <- Q(A) + 1/N(A)[R - Q(A)]

First of all, let’s implement the EGreedy algorithm.

import numpy as npclass EGreedy:
    """
    Implementation of EGreedy algorithm as described in Section 2 of book:
    Reinforcement Learning: An Introduction (Version 2)
    Richard S. Sutton and Andrew G. Barto
    """
    def __init__(self, k, epsilon=0.1):
        """
        Constructor of EGreedy
        :param k: [int], number of arms. 0-based indexing.
        :param epsilon: [float, default=0.1], epsilon value in range (0.0, 1.0) for exploration
        """
        self.k = k
        self.epsilon = epsilon
        self.rewards = np.asarray([0.0 for _ in range(k)])
        self.steps = np.asarray([0 for _ in range(k)])def reset(self):
        self.rewards = np.asarray([0.0 for _ in range(self.k)])
        self.steps = np.asarray([0 for _ in range(self.k)])def choose(self):
        random_number = np.random.uniform(0.0, 1, 1)[0]if random_number < self.epsilon:
            return np.random.choice(self.k, 1, replace=False)[0].item()
        else:
            return np.argmax(self.rewards, axis=0)def feedback(self, arm_id, reward):
        self.steps[arm_id] += 1
        self.rewards[arm_id] += (1 / self.steps[arm_id]) * (reward - self.rewards[arm_id])

Then, we have to define the stationary probability distributions of infected people per area/borough. I’ll use Beta Distribution for this kind of distributions, and parameters a/b define the ratio infected/non_infected people. Let’s assume that there are M=5 areas.

Area 1 has a Beta(a=10, b=2)
Area 2 has a Beta(a=4, b=2)
Area 3 has a Beta(a=2, b=2)
Area 4 has a Beta(a=2, b=4)
Area 5 has a Beta(a=2, b=6)

As you can see from the distributions plotted below, area 1 has the highest probability of finding infected people. The next one is area 2 and so on.

Infection Probability Distribution for Area 1

Infection Probability Distribution for Area 2

Infection Probability Distribution for Area 3

Infection Probability Distribution for Area 4

Infection Probability Distribution for Area 5

The next thing we need to implement is the simulation part. Here’s the pseudocode:

egreedy = initialize_egreedy()
for i=1 to num_test_kits:
    area <- egreedy.choose()  // choose an area to send this kit
    beta_distro <- find_distro(area)  // find distro of area
    random <- random()  // random number in [0.0, 1) uniform distro
    if random < beta_distro.draw():
       reward = 1  // found an infected person using this kit
    else:
       reward = 0  // this kit was used on a non infected person    egreedy.feedback(area, reward)

As you can see, every newly created test kit is sent to an area and used on a person of this area. Then, a number is sampled from the Beta Distribution of this area to determine if the test kit was used on an infected person or not. We assume that the test kit is perfect (no false positives and no false negatives).

The expected reward rate chart after running the simulation for 5,000 test kits is:

and the distribution of test kits between the areas is:

Area 1: 4594
Area 2: 92
Area 3: 114
Area 4: 102
Area 5: 98

This simple online MAB algorithm distributed most of the test kits to the area with the most infected people but didn’t distribute them “fairly” in the rest areas. But, we still have a big improvement from nothing!

Stay tuned! In the next part, I’ll show you how to deploy this algorithm on AWS and how to improve the distribution of test kits in an ordered manner.

PS: Github repo: https://github.com/pm3310/mab-covid19

How Multi-Armed-Bandits can help us in the fight for Covid-19 — Part 1

Written by Pavlos Mitsoulis