Crash Course: Reinforcement Learning

Allen Wang
The Startup
Published in
7 min readNov 16, 2020


A short high-level introduction (without all the complicated math) to Reinforcement Learning

What is the process of training a dog to sit like? Well, your dog may initially be completely untrained and have no idea what to do.

You might tell him to sit, and the dog might start barking. You scold him and tell him and tell him to sit again, but this time he starts wagging his tail. Once again, you scold him. You continue to try and tell your dog to sit, and finally, on the 27th try, he sits! You give him a treat and a word of praise.

As you keep up this cycle of scolding and praising, your dog eventually learns to sit when you tell him to. Voila! You have just demonstrated reinforcement learning with your dog!

What exactly is Reinforcement Learning?

Reinforcement learning is essentially where an agent is placed in an environment and is able to obtain rewards by performing certain actions. The agent’s only goal is to maximize the amount of reward it can get.

In the dog training example, your dog serves as the agent, and the rewards were your words of praise and treats.

Reinforcement learning works by using trial and error (it can be very tedious, as seen from the dog training) from its own actions and experiences.

Reinforcement Learning vs Supervised/Unsupervised Learning

Reinforcement learning is a subset of machine learning, as are supervised and unsupervised learning.

The main difference between reinforcement learning and supervised learning is sequential decision making. While in supervised learning, the actions the agent makes does not affect the future, in reinforcement learning, every single input depends on the previous action the agent made.

The main difference between reinforcement learning and unsupervised learning is their goals. In unsupervised learning, the main goal is to find structure within a given dataset, whereas in reinforcement learning, the goal is to find a course of action that would maximize the total reward for the agent.

All 3 of these fall into the massive umbrella of machine learning in which agents learn from data.

Markov Decision Process

The Markov Decision Process is the mathematical framework that describes the environment of a reinforcement learning model. In reinforcement learning models, all future states depend only on the present state, which means that they are a Markov Process. Reinforcement learning is a technique that attempts to learn an MDP and find the optimal policy.

Common Terminology:

  • State is essentially what the agent observes in its environment at a certain moment
  • Actions are the possible moves that the agents can perform in the environment
  • The Reward is what the agent receives if it achieves a desirable result
  • Discount is an optional factor that determines the importance of future rewards relative to now; It can range from 0 to 1.
  • The Value of a state is the expected long-term return (may include discount for the state)
  • Policy is the strategy that the agent employs to determine the next action. The optimal policy is the one that maximizes the amount of reward expected to receive.

Maze Game Example

Let’s say the agent was the robot and the maze was the environment. The state would be the position of the robot at any point in time. It would utilize a policy to determine which path to take. The actions would be going left, right, up, or down. We could award the robot +1 point for hitting an empty square, -1 point for hitting a wall, and+100 points for reaching the exit.

At first, the robot starts off with no experience at all, so it has completely random movements. However, as it starts to learn the values of each state, it begins to become smarter and smarter, finally completing the maze.

Exploitation vs Exploration

Let’s revisit the maze example. Let’s pretend that the robot initially chooses to go to the right, thus assigning a value of 1 to the white space to the right of it. Then it goes on and ends the episode. When the robot starts again, it now knows that the state of the space to the right has a value of 1, while the state of the space on top has a value of 0. Thus, it will always go right no matter what!

This brings us to the choice of exploitation vs exploration. The robot never actually had a chance to explore all the other options; It simply chose the state that had the highest value. This is known as the Greedy Policy, where the agent always picks the highest value.

One option here is to “exploit” the agent’s previous knowledge, in which it always picks the state that will give it the highest reward. The other option would be to “explore” the other states to see if they would potentially give a higher reward.

Depending on the situation, both have their advantages. If you needed to minimize the amount of loss, exploitation would work well for you to use previous experiences of what states received rewards. If you simply wanted to find the best possible method, without simulations that did not have any restrictions, exploration would be best.

Normally, a combination of both exploitation and exploration is used depending on the problem that needed to be solved (this can be changed as the project progresses).

Episodic vs Continous

A reinforcement learning model can be either episodic or continuous.

Episodic simply means that there is a “terminal” condition for the game, whether that be winning or losing. An example of a program that would require episodic reinforcement learning would be the game pong. In pong, the simulation resets every time the agent wins or loses the game (first to 11 points).

Continuous means that there is no end condition for the game, and that the model will just keep on running until stopped. For instance, a reinforcement learning model applied to the stock market would keep on going until it is manually terminated.

Monte Carlo vs Temporal Difference Learning Methods

In the maze example, the reinforcement learning model could’ve used either the monte carlo or temporal difference method.

The Monte Carlo method waits until the end of the episode, then checks the cumulative reward that it has received. It then calculates and updates the the expected award for each state at the end.

The Temporal Difference method updates the value of each state after each time step instead of at the very end. Thus, it continually receives feedback from rewards and updates its guesses of each value of the state.

Different Approaches to Reinforcement Learning

Value Based

In a value-based reinforcement learning, we want to find the maximum value function.

The value function is a function that tells us the amount of reward that an agent can expect to get in the future at each state (That is what we were using for the maze example). With this learning, the agent will always pick the state that has the highest reward.

In this example, the agent will start at -7, then continue on to -6, -5, and so forth until it reaches the goal, as those states have the largest values. Once you find the optimal value function, the optimal policy is able to found from it.

Policy Based

In a policy based reinforcement learning, the agent is essentially told where to go by the policy function. The policy function is a function that tells how the agent will make a decision.

Policy functions usually start off random and with a value function that corresponds to it. It then finds a new value function and improves its policy. It keeps going until it finds the optimal policy and value function.

As shown, the policy function essentially tells the agent the best direction to go.


So that’s it! You just completed a high-level overview of the components of Reinforcement Learning. Just to recap:

  • RL is essentially where an agent is placed in an environment and is able to obtain rewards by performing certain actions.
  • RL is different from supervised/unsupervised learning
  • RL attempts to find the most optimal policy of the Marko Decision Process
  • Exploitation picks the most optimal choice from is known, while exploration picks a random choice that may not be considered optimal at the moment
  • Episodic means that the RL model has a designated win/loss, while continuous means that the RL model will keep running until manually stopped
  • Monte Carlo method receives reward and updates value function at end of episode, while TD makes guesses to improve the value function at the end of each time step
  • Value Based finds optimal value function then derives optimal policy function from it, while policy based continuously updates its policy function to find the most optimal policy and value function

Hope you learned something from this article!

Feel free to reach out to me on Instagram, LinkedIn, or email!



Allen Wang
The Startup

Hey! I'm an 18 year old that is passionate about emerging tech, especially AI! Nice to have you here :)