Introduction to Reinforcement Learning

What is Reinforcement Learning?

Published in

AI Club @IIITB

4 min readSep 1, 2018

One must have heard of the more common branches of Machine Learning, namely, Supervised Learning and Unsupervised Learning. Well, there is one more — Reinforcement Learning.

The idea behind Reinforcement Learning is that the agent learns by interacting with the environment and receiving re- wards (or punishment) as a feedback. The concept is simple. For instance, consider a child is given a hot glass of milk. He makes an attempt to gulp the very hot milk but immediately realizes that it is very hot and should not be gulped. So, he starts sipping it to finish it. In this case, he learned by actually trying out gulping (an action) and got a negative feedback for it (burned his tongue). This is how humans learn, and thus reinforcement learning helps build models that learn using the same method.

Recent successes of RL agents beating human video game players, defeating a Go grand master at the hands of DeepMind’s AlphaGo, and demonstrations of bipedal agents learning to walk in simulation have all contributed to the general sense of enthusiasm about the field.

Terminology in Reinforcement Learning:-

Let us look at some terms used in RL:

• Agent: A system embedded in an environment. It performs actions to change the state of the environment. Examples include mobile robots, software agents, or industrial controllers.

• Environment: The external system that an agent is ”embedded” in.

• MDP (Markov Decision Process): A probabilistic model of a sequential decision problem, where states can be perceived exactly, and the current state and action selected determine a probability distribution on future states. Essentially, the outcome of applying an action to a state de- pends only on the current action and state (and not on preceding actions or states).

• State: A state can be viewed as a summary of the past history of the system, that determines its future evolution.

• Reward: A scalar value which represents the degree to which a state or action is desirable. Reward functions can be used to specify a wide range of planning goals (e.g. by penalizing every non-goal state).

• Discount factor: A scalar value between 0 and 1 which determines the present value of future rewards. If the discount factor is 0, the agent is concerned with maximizing immediate rewards. As the discount factor ap- proaches 1, the agent takes more future rewards into account. Algorithms which discount future rewards include Q-learning and TD(lambda).

• Policy: The decision-making function (control strategy) of the agent, which represents a mapping from situations (states) to actions.

• Value Function: It is a mapping from states to real numbers, where the value of a state represents the long-term reward achieved starting from that state, and executing a particular policy. The key distinguishing feature of RL methods is that they learn policies indirectly, by instead learning value functions. RL methods can be contrasted with direct op- timization methods, such as genetic algorithms (GA), which attempt to search the policy space directly.

• TD (temporal difference) algorithms: A class of learning methods, based on the idea of comparing temporally successive predictions. Possibly the single most fundamental idea in all of reinforcement learning.

Episodic and Continuous task: In the case of episodic tasks, we have a starting point and an ending point (a terminal state), thus creating an episode (a set of states, actions, rewards and next states). Whereas, in the case of continuous tasks, there is no terminal state. The agent learns how to choose the best actions and simultaneously keeps interacting with the environment.

Markov Decision Processes (MDP):

MDPs include:

• A finite number of states. (S)

• A set of actions available in each state. (A)

• Transition between states.

• Rewards associated with each transition. (R)

A discount factor between 0 and 1. (γ)

Reinforcement Learning as compared to other forms of Machine Learning

Supervised Learning

In Supervised Learning, the computer system (model) is fed with many examples of a given item and the computer is expected to calculate the similarities between those items so that it can recognize other examples of that item which it has not seen yet. Supervised Learning Problems can be further subdivided into Classification and Regression problems.

Unsupervised Learning

Unsupervised learning is where you only have input data and no corresponding output variables. The goal for unsupervised learning is to model the underlying

structure or distribution in the data in order to learn more about the data. Unsupervised Learning algorithms can be further subdivided into Clustering and Association problems.

Reinforcement Learning

As already explained what RL is, let us reiterate it. Reinforcement Learning is a type of machine learning that tells the agent if it has made the correct decision or the wrong decision. With enough iterations a reinforcement learning system will eventually be able to predict the correct outcomes and therefore make the ‘right decision’.

References:

• Richard S. Sutton and Andrew G. Barto’s Reinforcement Learning: An Introduction.

Lectures: [UCL] COMPM050/COMPGI13 Reinforcement Learning by David Silver.

Author of the article:

Nikunj Gupta: https://www.linkedin.com/in/nikunj-gupta97/