Reinforcement Learning Mimics Human Learning
Read to understand how Reinforcement learning is influenced by human learning.
What will you learn here?
- What is the difference between Supervised Learning, Unsupervised Learning, and Reinforcement Learning?
- Understand how Reinforcement Learning mimics human behavior,
- Different components of Reinforcement Learning(RL) and how they interact
- Applications of Reinforcement Learning(RL) in real-world scenarios.
This article is adapted and inspired from Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto.
What is the difference between Supervised Learning, Unsupervised Learning, and Reinforcement Learning?
Supervised Learning algorithms learn from a labeled dataset. The labels in the dataset provide the answer to the input data.
The Supervised algorithm's objective is to find the function(f) that maps the input data(x) to the labels(y). The supervised algorithm extrapolates or generalizes its responses to correctly identify variations of input data not present in the training dataset.
Example: Image Classification, the input data contains images for different flowers, and labels have the name of the flowers.
The Supervised image classification algorithm associates the different features of the images like the color of the flower, no. of petals, the shape of the flower, etc., to the flower's name. When a new image of a flower that the model is trained on, even with small variations, is sent an input to the supervised model, it will predict the image's label.
Unsupervised Learning identifies the hidden data pattern in collections of the unlabeled datasets.
The Unsupervised algorithm is given a dataset without any specific desired outcome or labels. The Unsupervised algorithm's objective is to find the structure in the data by extracting the useful features present in the input data.
The Unsupervised algorithm’s objective is to find the function(f) that automatically identifies and extracts useful feature from the input data(x)
Example: Clustering, the input data contains a set of the hand-written digit, and the unsupervised learning algorithm will cluster the data based on the image structure
Reinforcement learning is learning what to do for sequential decision-making
RL learns how to map situations in an environment to actions to maximize a numerical reward signal. The learner must discover what actions yield the most reward by trying them using exploration and exploitation.
Trial-and-error search and delayed reward are the two most important distinguishing features of reinforcement learning.
In RL, an agent(learner) has to learn to make decisions(actions) in an environment by responding to a situation(observation) that will maximize cumulative rewards.
The below diagram summarizes Supervised, Unsupervised and Reinforced learning.
Intuitive explanation of Reinforcement Learning and its elements
We will understand Reinforcement learning by understanding how we learn different actions and acquires different behaviors as we learn new skills.
You are a newbie to baking and want to try baking your favorite cake. You follow the recipe and mix all the ingredients, but you are unsure if the batter's consistency is right. You do your best of what is shown in the video and then put the cake for baking following the time and temperature. The cake is good but does not turn out as soft as you expected.
In the example above, you are the Agent, the learner trying to learn how to bake a delicious soft cake. The cooking process is the Environment. The taste and texture of the cake is your Reward. The Reward signal for learning to bake is to make a delicious soft cake. The Policy is the tweaks to the recipe that you follow, including the consistency of the batter and the baking of the cake. When you tweak the consistency of the batter, that results in a softer and delicious cake after baking is your Value Function. The consistency of the batter and the baking time and temperature, and all the steps in your recipe form the Model. The recipe (Model) that you follow allows inferences about how the cooking and baking(Environment)will behave.
Let’s dive deeper into Reinforcement Learning.
The objective of the RL is that an Agent(learner) learns a good behavior by acquiring new behaviors and skills incrementally by interacting with the Environment through the process of trial-and-error.
The Agent does not require complete knowledge or control of the Environment. The Agent has to deal with the exploration/exploitation dilemma while learning.
What is Exploration and Exploitation in RL?
Exploration is about obtaining information about the environment, while exploitation is about maximizing the expected return given the current knowledge.
Exploration is similar to trying different cake batter consistency and different temperature setting for the oven. When you do this exploration, you may get good or bad rewards, but all learning opportunities are valuable to bake a good cake in the long run.
Hence the Agent should explore only when the learning opportunities are valuable enough for the future to compensate for what exploitation can provide. Exploitation is pursuing the most promising strategy based on the experience gathered so far based on exploration to get the best rewards.
Elements of Reinforcement Learning
Reinforcement learning is based on the reward hypothesis applied to any sequential decision-making problem that relies on learnings from experience.
The Agent learns by interacting with the environment. Using the experience gathered by exploration and exploitation, the agent should maximize the cumulative rewards. Agent must be able to sense the state of its environment and take actions that impact the state of the Environment. The Agent may not need to have complete knowledge of the Environment.
An Agent is placed in an environment and needs to learn good behavior within the Environment. The Environment is the task that Agent is trying to learn and maximize the cumulative reward. Environments can be deterministic or stochastic.
A reward is a scalar feedback signal to indicate how well the agent is learning in an environment. The agent’s job is to maximize the cumulative reward. A Reward may be delayed as it may be better to sacrifice immediate reward to explore the environment to gain more long-term reward.
A policy defines how an agent selects an action to maximize the cumulative reward the agent seeks. It maps the perceived states of the environment to actions taken when in those states like a stimulus-response association.
The value function specifies what is good in the long run. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. A value function represents how good each state/action pair is and provides an estimate of how good it is for the agent to be in a given state.
Model is the Agent’s representation of the environment. It mimics the behavior of the environment and allows inferences to be made about how the environment will behave when the Agent takes action based on the state of the model.
Real-World Application of Reinforced learning
- Stock Trading, where the stock market is the environment, the changing stock prices are the observation, and the RL agent takes action to either buy, sell or hold the stock. The profit or loss that you make while trading stock is the reward.
- Robotics: A Robot is an Agent that perceives its environment through sensors, and then these sensors change the state of the Environment through actuators. The robot learns the task through exploration and exploitation to maximize the reward for learning the task well.
- Online recommendation: The Agent learns the interactions between users and items in the online recommendation system and takes both the immediate and long-term rewards into account for recommending the items to the user.
- Optimization problems like Bid optimization, Supply chain optimization, and energy consumption optimization. As an example for supply chain optimization, where an optimal decision needs to be made for how much raw material to order, how many products to be produced, and how much to stock in the warehouse. The Agent learns the seasonality, demand, and market trends, etc which form the stochastic environment. The RL reward is the profit made during each quarter.