Reinforcement Learning: Introduction to Q Learning

Photo by Franck V. on Unsplash

Reinforcement Learning is an area of machine learning where an agent learns by interacting with the environment surrounded. A well-known method for an agent to learn behave optimally is Q-learning. In Q-learning, an agent learns all pairs of state-action and their corresponding value (represented in a table format). As time pass, when the agent interacts sufficiently enough with the environment, the agent is able to find optimal action given what state it is in. Q-Learning is guarantee to converge to an optimal solution. See more details in this paper.


There are a few terminologies in reinforcement learning:

Agent: the learner that wants to learn optimal behaviour in an environment.

Environment: the place where agent interacts. It is responsible to transit the agent from one state to another if agent takes an action. Also it gives immediate reward to agent for an action taken. The transition from state to state of an action is stochastic. That means a state-action pair does not always result in the same next state. However the probability in which it leads to a particular next state is the same.

State: The state of an agent in the environment.

Action: The action that an agent can take in the environment.

Reward: The reward an agent receives from the environment by taking an action. A reward can be either positive or negative (meaning punishment).

Q value: The value that indicates how good or bad of a state-action pair. (how good if an agent take this action in this state)

Q table: The table the represent Q value with regards to a state-action pair.

Policy: The policy tells you what action you should take in a state. That can be viewed as a map from state to action. An ultimate policy of successful training always provides action that leads to highest next Q value.

Discount factor: Reward is discounted by discount factor — gamma. An example in real world is that $1000 you get in 10 years later worths much less than the $1000 get right now.


The typical flow for an agent to learn is as following.

  1. An agent initializes its Q table with arbitrary small value for each state-action pair.

2. Agent starts with a state in an environment.

3. The agent choose an action from the Q table. It picks the action that may lead to the next state of highest Q value. (Note that, in early learning stage, it chooses very randomly because it has no knowledge or very little knowledge yet.)

4. The environment reacts the agent’s action, and tell the agent what its next state should be, and what immediate reward it gets from the action.

5. The agent receives that reward and use the bellman equation to compute the Q value of this state-action pair. Then it updates this value in Q table.

6. Repeat step 3 to step 5 until the Q table is converged.

Once the Q table is converged, the agent can then always provide best action.

Update with Bellman equation in below

Q’(St, At) = (1-alpha) * Q(St, At) + alpha * (reward + gamma * Q(St+1, A) )


Now you get the fundamental knowledge of reinforcement learning and you want to practise in code? Start with OpenAI gym, where you can find many environments for you to train your agent.


Video Resources for RL — David Silver’s Lectures

A great book for reinforcement - Reinforcement Learning: An Introduction