Vishaal Saravanan
Analytics Vidhya
Published in
6 min readMay 27, 2020

--

Apart from Supervised and Un-Supervised Learning Algorithms, one of the most intriguing and highly practiced domains of Artificial intelligence in recent years include Reinforcement learning. From making bots to play against humans in modern-day games to coordinating operations researches in mathematical computing, reinforcement learning plays a vital role alongside other learning algorithms.

Unlike Supervised learning were labeled(structured)data gets presented to take optimal actions to make the algorithm learn, reinforcement learning focuses more on building the knowledge triangle. Starting with zero prior knowledge and stepping up its way until it reaches maximum knowledge weightage for taking a high rewardable action.

Q-Learning is a type of reinforcement learning algorithm where the algorithm learns a policy telling an agent to take specific action under a given circumstance. Using Markov Decision Process (MDP), the Q-Learning algorithm finds an optimal policy to maximize the amount of single reward from the current given state to the next successive states resulting in maximum net total reward by playing a suitable action.

(Note: Each action involves some negative cost associated with them, so to maximize the net compensation the algorithm finds the optimal path with fewer steps).

Perhaps Q Learning Algorithm put together into the following more straightforward steps:

Step 1(Initialization):

For all the states s and actions a, the actions Q-Values are initialized to zero:

( a )We start in the initial state s0,

( b )We play a random action,

( c )We reach the first state s1.

Step 2(Algorithm Itself): Then for each t>=1, we repeat the below steps for certain epochs (Here We implement for 1000 cycles)

( a )We select a random state set from our possible states

( b )We play an arbitrary action at that can lead to the next possible state, i.e., such that R(st, at)>0:

( c )We reach the following state s(t+1), and we get reward R (st, at).

( d )We compute the temporal difference (T.D):

( e )We update the Q-value applying the Bellman equation:

Now let us discuss the simple problem statement for which we then use Q-Learning Algorithm in Python.

Let A be the initial state where we start our robot, and our ultimate goal is to reach the state D with the minimum amount of action (thus finding the shortest route).

Now let us discuss the reward matrix for the given states at the starting state; for all the possible physical routes where robots can move from the given state to another, we assign rewards as one(1), and for the rest, we assign zero(0) as a reward. (Although since reaching the goal state D is the motive of our robot we later assign a higher premium for D for making it top of its priority).

While implementing Q-Learning, we come across a term called Temporal Difference (T.D):
T.D enables the Q-Learning algorithm to update itself based on its value. In a more straightforward sense, T.D is a sum of reward obtained from moving one state to another state and Discounting factor times the maximum of Q value of the next state concerning the action and Q value of the current state with its action.
Finally, We update the Q-value of the next corresponding state by the Q value of the previous state and learning rate(alpha )times the temporal difference involved.
Thus Temporal difference helps to prevent the Q-value from exploding.
And hyperparameters discounting factor and learning rate are generally obtained by the trial and error method.

Implementation in Python:
Like all other machine learning programs, we import NumPy as np:

Then we predefine two parameters:
Gamma(Discounting Factor):
Alpha(Learning Rate):
Discounting Factor: Factor at which the Q-Value gets decremented after each cycle.
Learning Rate: Rate at which the algorithm learns after each cycle.
Here cycle refers to each instance where the algorithm moves from the initial state to the final state.

Then we define a dictionary of states mapped to the indexes and array of actions.

Then define the reward matrix for the possible action taken.

Since we need a dictionary of indexes mapped to states in computing q-Values later section of the code, we create a new dictionary for storing them.

Now we directly apply the above-discussed algorithm.

( a )Let us define a python function route taking in the initial state and final goal state as arguments.

( b )We create a copy of reward matrices to create explicit initial and goal reward matrix rather than implicitly hardcoding them.

( c )We get the indexes of the goal state in the ending_state variable.

( d )We define the maximum reward state for the goal state implicitly as 1000.

(e)We initialize Q -Value as a Numpy array of zeros.

( f )For Training, We use for loop ranging for 1000 cycles; a random current state gets initialized the same size as the Q-Matrix, and for each action in the Reward matrix, greater than one step gets appended to the playable_actions list. Next Random state and corresponding Temporal Difference and Q value get calculated.

( g )Finally, the Q value for the maximum argument in the array gets calculated using the NumPy ArgMax function and appended to the route list containing the starting location passed as its first argument.

( f )The route list gets returned as a callback to the route function, and we pass in the starting, ending point as the argument to the route function to get our self made Q-Learning Algorithm.

Output

Final Function looks like:

Hurray, we have just implemented our first Q-Learning Algorithm in Python, Its as simple as that. Now try applying the same Q-Learning algorithm for more complicated real-life problems and create a fantastic solution to the market demanding challenges.

Head to my GitHub repo to find the complete code:https://github.com/vishaalsaravanan/Q_Learning/blob/master/q_learning.py

--

--