Reinforcement Learning

7 min readJul 13, 2018

--

Reinforcement learning is type of machine learning algorithm for learning what to do i.e. how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told what actions to take, but instead must discover which actions yield the most reward by trying them.

Actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards.

Trial-and-error search and delayed reward — are the two most important distinguishing features of reinforcement learning.

Reinforcement Learning

Reinforcement Learning Introduction: Details

A full mathematical specification of the reinforcement learning problem is in terms of optimal control of Markov decision processes.

The basic idea is to capture the most important aspects of the real problem facing a learning agent that interacts with its environment to achieve a goal.

Such an agent must have three aspects:

Sensation: sense the state of the environment to some extent
Action: take actions that affect the state
Goal: must have goal(s) relating to the state of the environment

Exploration vs. Exploitation

One of the challenges that arise in reinforcement learning is the trade-off between exploration and exploitation.

To obtain a lot of reward, a reinforcement learning agent must take actions that it has tried in the past and found to be effective in producing reward, i.e., it has to exploit what it already knows. But to discover such actions, the agent has to try actions that it has not selected before but it also has to explore in order to make better action selections in the future.

The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear best.

Examples:

An expert chess player makes a move. The choice is informed both by planning, anticipating possible replies and counter-replies, and by immediate judgments of the desirability of positions and moves.
An adaptive controller adjusts parameters of a petroleum refinery’s operation in real time. The controller optimizes the yield/cost/quality trade-off on the basis of specified marginal costs.
A mobile robot decides whether it should enter a new room in search of more trash to collect or start trying to find its way back to its battery recharging station. It makes its decision based on how quickly and easily it has been able to find the re-charger in the past.

Reinforcement Learning: Building an Elementary Model

The four principal elements of a reinforcement learning system are:

Reinforcement Learning: Building an Elementary Model

a policy, which defines the learning agent’s way of behaving at a given time. A policy is a mapping from perceived states of the environment to actions to be taken when in those states.
a reward function, which defines the goal in a reinforcement learning problem. It maps each perceived state (or state–action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state.
a value function specifies what is good in the long run. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
a model of the environment is something that mimics the behavior of the environment.

Bandit Problems

Consider the following learning problem:

You are faced repeatedly with a choice among n different options.
After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on your choice.
Your objective is to maximize the expected total reward.

Related image — Multi-armed Bandit Problem

This is the original form of the n-armed bandit problem, so named by analogy to a slot machine.

Bandit Problems: Solution

Each action has an expected or mean or, a value of that action. If you knew the value of each action, then it would be trivial to solve the n-armed bandit problem: you would always select the action with highest value.
If you maintain estimates of the action values, then at any time there is at least one action whose estimated value is greatest. We call this the greedy action.
If you select a greedy action, you are exploiting your current knowledge of the values of the actions. If instead you select one of the non-greedy actions, then you are exploring because this enables you to improve your estimate of the non-greedy action’s value.

Bandit: Exploration vs. Exploitation

Exploitation is the right thing to do to maximize the expected reward on one play.
Exploration may produce the greater total reward in the long run.
For example, suppose the greedy action’s value is known with certainty, but other actions are estimated to be nearly as good but with substantial uncertainty, so that one of the other actions probably is actually better than the greedy action.
If you have many plays yet to make, then it may be better to explore the non-greedy actions and discover which of them are better than the greedy action. Reward is lower in the short run, during exploration, but higher in the long run.

ϵ-Greedy Strategy:

Reinforcement Learning:Epsilon-Greedy Strategy

Estimate the value from each action as the long term average Q(a)=(r_1+r_2+…+r_k)/k where k is the number of occurrences of action a.
The greedy strategy would be to always to choose then action a that has the largest Q(a). This is likely to be sub-optimal since this strategy is doing no exploration.
Instead if we choose a most of the time, but once in a while, with a small probability ϵ we choose one of the other actions.

SoftMax Strategy:

An improvement of the ϵ-Greedy strategy is the SoftMax strategy.
The SoftMax strategy is to vary the action probabilities as a graded function of the estimated value. The greedy action is still given the highest selection probability, but the other strategies are ranked and weighted according to their value estimates. These are called SoftMax action selection rules.
The most common softmax method uses a Gibbs, or Boltzmann, distribution, that determines how likely each action is.

General Formulation: Reinforcement Learning

The learner and decision maker is called the agent.
The thing it interacts with, comprising everything outside the agent, is called the environment.
The agent and the environment interact continually, the agent selecting actions and the environment responding to those actions and presenting new situations to the agent.
The environment also gives rise to rewards, special numerical.

Reinforcement Learning : General Formulation

General Formulation: State-Action

The agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . .
At each time step t , the agent receives some representation of the environment’s state, s_t ϵ S , where S is the set of possible states.
The agent selects an action, a_t ϵ A on the basis of the perceived state s_t, where A is the set of actions available in state s_t.
One time step later, in part as a consequence of its action, the agent receives a numerical reward, r_(t+1) ϵ R, and finds itself in a new state s_(t+1).

General Formulation: Policy

At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent’s policy and is denoted π_t, where π_t (s,a) is the probability that a_t=a if s_t=s.
Reinforcement learning methods specify how the agent changes its policy as a result of its experience.
The agent’s goal is to maximize the total amount of reward it receives over the long run.

General Formulation: Framework

This framework is abstract and flexible and can be applied to many different problems in many different ways.
For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary successive stages of decision-making and acting.
The actions can be low-level controls, such as the voltages applied to the motors of a robot arm, or high-level decisions, such as whether or not to expand production or hire/fire employees.
Similarly, the states can take a wide variety of forms. They can be completely determined by low-level sensations, such as direct sensor readings, or they can be more high-level and abstract, such as symbolic descriptions of objects in a room.

General Formulation: Returns

Returns can be over a finite horizon T, r_1+r_2+…+r_T. These are called episodic tasks.
If the agent–environment interaction goes on perpetually, it is called a continuing task. The total reward in this case may be infinite, and usually we must consider the discounted total reward, using a discount factor θ , θr_1+θ² r_2+…+θ^n r_n+…
In general, both episodic and continuing tasks can be written in the same format by adopting the convention that the discount factor for episodic task rewards is 1.

Reinforcement Learning: Reward-Maximization Behavior

Summary

Reinforcement learning is about learning from interaction how to behave in order to achieve a goal.
The reinforcement learning agent and its environment interact over a sequence of discrete time steps.
The agent-environment interface defines a task: the actions are the choices made by the agent; the states are the basis for making the choices; and the rewards are the basis for evaluating the choices.
Everything inside the agent is completely known and controllable by the agent; everything outside is incompletely controllable but may or may not be completely known.
A policy is a stochastic rule which selects actions as a function of states.
The agent’s goal is to maximize the total reward received over time.

Thanks for reading through this blog post. Any suggestions for further improving this would be cheerfully solicited.

Reinforcement Learning

Machine Learning

Artificial Intelligence

Greedy Algorithms

Mehul Ved

Written by Mehul Ved

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams