# Reinforcement Learning to eradicate malaria with AI

This year, for the first time in its history, KDD Cup is running a Reinforcement Learning competition in partnership with IBM Africa, University of Oxford and Hexagon-ML for a humanitarian cause. For this challenge, we are crowdsourcing the application of AI and Reinforcement Learning to solve a complex problem that has the potential to save the lives of many millions of people.

The problem is to *support the efforts to fight malaria* — a life-threatening mosquito-borne disease that killed 435K people worldwide in 2017 alone (1). In fact, about 3.2B people, nearly half theworld’s population are at risk of contracting malaria according to World Health Organization (WHO).

Today, most of malaria cases come from Sub-Saharan Africa. In past years, prior to wide-scale malaria prevention efforts, malaria was the number one killer of children in this region, and it was killing 1.2M people a year, all from a mosquito bite. Today, *insecticide-treated nets (ITNs)* became the primary method of malaria prevention, because the anopheles mosquito only bites after nine o’clock at night, when most kids are in bed. Once a mosquito lands on the net, it dies because of the insecticide, which disrupts the reproductive cycle. Many countries in the Sub-Saharan Africa region rely heavily on external funding for malaria control and prevention. According to WHO, $3.1B were made available for fighting malaria in 2017. However, in recent years, investments in this area started to level-off. At the same time, after a period in which the number of malaria cases steadily decreased, in recent years that trend started to pick-up with an increase in 2017. According to WHO there were 219M cases in 2017, as compared to 217M cases in 2016. Therefore, effective use of limited resources to get malaria under control has been critical.

In addition to ITNs, the other malaria preventive policies include: *indoor residual spraying (IRS)*, vector larvicide in bodies of water, and malaria vaccination. However, the space of possible policies for malaria prevention is daunting and inefficient for human decision makers to explore without adequate decision support tools. The IBM Africa research team, has taken strides in the effort of Malaria eradication by developing a world-class environment to distribute bed nets and repellants. Their goal is to develop a custom agent that will help identify the best policies for rewards based on the simulation environment.

**Reinforcement Learning**

Reinforcement learning is a sub field of machine learning that enables an agent to learn in an interactive environment by trial and error leveraging feedback from its own actions and experiences.

[Source: Dr. Gosavi] RL is generally formulated and solved as the Markov decision problem (MDP). An MDP has the following elements : (1) State of the system, (2) Actions, (3) transition probabilities, (4) transition rewards, (5) a policy, and (6) a performance metric.

**State:** The “state” of a system is a parameter or a set of parameters that can be used to describe a system. For example the geographical coordinates of a robot can be used to describe its “state.” A system whose state changes with time is called a dynamic system. Thus, a moving robot produces a dynamic system.

Another example of a dynamic system is the queue that forms in a supermarket in front of the counter. Imagine that the state of the queuing system is defined by the number of people in the queue. Then, it should be clear that the state fluctuates with time, and then the queue is a dynamic system. It is to be understood that the transition from one state to another in an MDP is usually a random affair.

Consider a queue in which there is one server and one waiting line. In this queue, the state x, defined by the number of people in the queue, transitions to x + 1 with some probability and to x − 1 with the remaining probability. The former type of transition occurs when a new customer arrives, while the latter event occurs when one customer departs from the system because of service completion.

**Actions**: Now, usually, the motion of the robot can be controlled, and in fact we are interested in controlling it in an optimal manner. Assume that the robot can move in discrete steps, and that after every step the robot takes, it can go North, go South, go East, or go West. These four options are called actions or controls allowed for the robot.

For the queuing system discussed above, an action could be as follows: when the number of customers in a line exceeds some prefixed number, (say 10), the remaining customers are diverted to a new counter that is opened. Hence, two actions for this system can be described as: (1) Open a new counter (2) Not open a new counter.

**Transition Probability**: Assume that action a is selected in state i. Let the next state be j. Let p(i, a, j) denote the probability of going from state i to state j under the influence of action a in one step. This quantity is also called the *transition probability*. If an MDP has 3 states and 2 actions, there are 9 transition probabilities per action.

**Immediate Rewards**: Usually, the system receives an immediate reward (which could be positive or negative) when it transitions from one state to another. This is denoted by r(i, a, j).

**Policy:** The policy defines the action to be chosen in every state visited by the system. Note that in some states, no actions are to be chosen. States in which decisions are to be made, i.e., actions are to be chosen, are called decision-making states.

**Performance Metric:** Associated with any given policy, there exists a so-called performance metric — with which the performance of the policy is judged. Our goal is to select the policy that has the best performance metric.

**Time of transition**: We will assume for the MDP that the time of transition is unity (1), which means it is the same for every transition. Hence clearly 1 here does not have to mean 1 hour or minute or second. It is some fixed quantity fixed by the analyst.

Further, reinforcement learning can be used to solve the following types of problems:

A very good starting introduction to reinforcement learning can be found at here.

**KDD Cup 2019 Competition**

For the KDD Cup 2019 competition the goal is to do a **policy search**. The leaderboard for the **first phase** is based on the **median rewards** obtained by policy.

The KDD Cup competition hosts (IBM Research & Hexagon-ML team) define the State, Agent, Reward and Policy for malaria modeling environment as follows:

## State

Observations for the challenge models occur over a 5 year timeframe and each year of this timeframe may be considered as the State of the system, with the possibility to take one Action for each State. It should also be noted that, this temporal State transition is fixed and as such not dependent on the Action taken.

𝑆∈{1,2,3,4,5}

## Action

Consider Actions as a combination of only two possible interventions i.e. *Insecticide spraying (IRS) and distributing bed nets (ITN) *based on our model description.

𝑎ᴵᵀᴺ∈[0,1] and 𝑎ᴵᴿˢ∈[0,1]

Action values between O and 1 describe a coverage range of the intervention for a simulated human population.

𝐴s=[𝑎ᴵᵀᴺ,𝑎ᴵᴿˢ]

## Reward

A reward function determines a Stochastic Reward for a Policy over the entire episode, this function acts to determine the Health outcomes per unit cost for the interventions implemented in the policy. In order to have a notion of goodness maximizing the Reward we negate this value.

𝑅𝜋∈(−∞,∞)

## Policy

Therefore a Policy (π) for this challenge consists of a temporal sequence of Actions, as illustrated in the figure below.

**Time of transition**

The time of transition for this problem is fixed to 1 year as shown in the diagram above.

Since Reinforcement learning based on data-science contests are relatively new, a few very good tutorials with code using Genetic Algorithms, Vanilla Policy Gradient Agent, and others are provided for the competition.

In conclusion, this year for the KDD Cup we want to tackle the complex problem of finding the optimal malaria prevention policy by applying AI and Machine Learning techniques. More specifically by implementing Reinforcement Learning approaches to model this problem and help optimize the process towards minimum costs and human labor.

References

- https://www.kdd.org/kdd2019/kdd-cup
- https://www.who.int/malaria/en/
- https://www.who.int/news-room/detail/19-11-2018-who-and-partners-launch-new-country-led-response-to-put-stalled-malaria-control-efforts-back-on-track
- https://www.ibm.com/blogs/think/2018/02/ai-malaria/
- https://arxiv.org/abs/1712.00428
- https://web.mst.edu/~gosavia/tutorial.pdf
- https://compete.hexagon-ml.com/tutorial/
- https://compete.hexagon-ml.com/tutorial/kdd-cuphumanities-track-tutorial-genetic-algorithm/
- https://compete.hexagon-ml.com/tutorial/kdd-cuphumanities-track-tutorial-policy-gradients/

*Author’s Note:*

*Thanks to my KDD Cup 2019 Co-Chairs (Iryna Skrypnyk , Wenjun Zhou) and Vitaly Doban for contributing to this post. The KDD Cup 2019 consists of three separate tracks : a regular track, an auto-ml track and this Humanity RL track.*

*Taposh Dutta Roy**, leads Innovation Team of Decision Support at Kaiser Permanente. These are his thoughts based on industry analysis. These thoughts and recommendations are not of Kaiser Permanente and Kaiser Permanente is not responsible for the content. If you have questions Mr. Dutta Roy can be reached via **linkedin**.*

*Wenjun Zhou** is an Associate Professor and the Roy & Audrey Fancher Faculty Research Fellow at the University of Tennessee Knoxville. Her general research interests are data mining, business analytics, and statistical computing.*

*Iryna Skrypnyk** leads AI-backed Clinical Data Science projects across different therapeutic areas, including Oncology, Immunology & Inflammation, Rare Diseases, and Vaccines, in tight collaboration with Research and Business Units. She is responsible for implementing innovative AI technologies and external scientific research collaborations in the IDEA AI Lab, Global Real World Evidence at Pfizer.*