Q-Network Reinforcement Learning Model

Published in

Analytics Vidhya

5 min readDec 2, 2020

Q-learning is an off-policy reinforcement learning algorithm that seeks to seek out the simplest action to require given this state, hence it’s a greedy approach. It’s considered off-policy because the q-learning function learns from actions that are outside this policy, like taking random actions, and thus a policy isn’t needed. More specifically, q-learning seeks to find out a policy that maximizes the overall reward. Let’s plow ahead and develop our first strategy.

What is Q-learning?

Q-learning could be a model-free reinforcement learning algorithm to find out the quality of actions telling an agent what action to require under what circumstances. It doesn’t require a model (hence the connotation “model-free”) of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations.

For any finite Markov decision process (FMDP), Q-learning finds an optimal policy within the sense of maximizing the first moment of the overall reward over any and every one successive steps, ranging from the present state.Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy.”Q” names the function that the algorithm computes with the most expected rewards for an action taken in a very given state.

Q-learning is an off policy reinforcement learning algorithm that seeks to seek out the most effective action to require given the present state. It’s considered off-policy because the q-learning function learns from actions that are outside this policy, like taking random actions, and thus a policy isn’t needed. More specifically, q-learning seeks to be told a policy that maximizes the full reward.

What’s ‘Q’?

The ‘q’ in q-learning stands for quality. Quality in this case represents how useful a given action is in gaining some future reward.

Now, consider the best case: assume we already know what’s the expected reward for every action on each step. How will we decide on an action during this case? Quite simply — we’ll choose the sequence of action that may eventually generate the very best reward. This cumulative reward we’ll receive is commonly stated as Q Value (an abbreviation of Quality Value), and that we can formalize our strategy mathematically as:

The above equation states that the Q Value yielded from being at state s and selecting action a, is that the immediate reward received, r(s,a), plus the very best Q Value possible from state s’ (which is that the state we ended up in after taking action a from state s). We’ll receive the very best Q Value from s’ by choosing the action that maximizes the Q Value. We also introduce γ, usually called the discount factor, which controls the importance of long run rewards versus the immediate one.

This equation is thought because the Bellman Equation, (Bellman et al) and (Peng et al) provides a comprehensive explanation of its mathematical derivation. This elegant equation is sort of powerful and can be very useful to us thanks to two important characteristics:

While we still retain the Markov States assumptions, the recursive nature of the Bellman Equation allows rewards from future states to propagate to far-off past states.
There’s no need to really know what are truth Q Values after we start off; Since it’s recursive, we will guess something, and it’ll eventually converge to the important values.

Let’s say we all know the expected reward of every action at every step. this may essentially be sort of a cheat sheet for the agent! Our agent will know exactly which action to perform.

It will perform the sequence of actions that may eventually generate the utmost total reward. This total reward is additionally called the Q-value and that we will formalise our strategy as:

The above equation states that the Q-value yielded from being at state s and performing action a is that the immediate reward r(s,a) plus the very best Q-value possible from the subsequent state s’. Gamma here is the discount factor which controls the contribution of rewards further within the future.

Q(s’,a) again depends on Q(s”,a) which can then have a coefficient of gamma squared. So, the Q-value depends on Q-values of future states as shown here:

Adjusting the worth of gamma will diminish or increase the contribution of future rewards.

Since this can be a recursive equation, we will start with making arbitrary assumptions for all q-values. With experience, it’ll converge to the optimal policy. In practical situations, this can be implemented as an update:

where alpha is that the learning rate or step size. This simply determines to what extent newly acquired information overrides old information.

In deep Q-learning, we use a neural network to approximate the Q-value function. The state is given because the input and also the Q-value of all possible actions is generated because the output. The comparison between Q-learning & deep Q-learning is wonderfully illustrated below:

So, the steps involved in reinforcement learning using deep Q-learning networks (DQNs) are as follows:

All the past experience is stored by the user in memory
The next action is decided by the utmost output of the Q-network
The loss function here is mean squared error of the anticipated Q-value and also the target Q-value — Q*. this can be basically a regression problem. However, we don’t know the target or actual value here as we are coping with a reinforcement learning problem. Going back to the Q-value update equation derived from the Bellman equation. we have:

4. The section in green represents the target. we will argue that it’s predicting its own value, but since R is that unbiased true reward, the network goes to update its gradient using backpropagation to finally converge.

Q-Network Reinforcement Learning Model

What is Q-learning?

What’s ‘Q’?

Written by Sayan Mondal