The Hello World of Reinforcement Learning
--
This challenge reveals a lot about RL and how it can be used to solve problems by allowing the machine to learn on its own and comprehend the concept of “inch broad, mile deep” rather than solving the problem for the sake of solving it. This is what I’m going to do.
Understanding the Problem
A compound with four indoor and one outdoor rooms is available. Rooms 1 and 4 lead to Room 5, the only outdoor room. The problem statement further indicates that our agent (the computer) will be placed in any room at random, and it must use its knowledge to determine the most expedient way out to room no. 5 while maximising the optimum rewards through the chosen path.
Sounds simple, right? It can be difficult for a machine, however. Remember that this problem is far too simple to demonstrate the notion of reinforcement learning, but we must demonstrate it on a simple problem.
Reward System
Assign points to the paths so that the machine can tell which one is more valuable. Give doors that lead straight outside 100 points.
The nodes represent the room numbers, while the arrows represent the paths followed.
The same diagram can now be expressed as a matrix, which the machine can interpret.
- -1: non-existent path
- 0: path not leading straight outside
- 100: the path leading straight outside
The Formula
Matrix Form
The beginning state is Q, while the actual state is R. It starts with Q, and then the machine fills the values in the Q matrix according to the formula. It will be visible in the syntax.
The values are fed into the Q matrix in this manner.
Code
Importing Numerical Libraries into the notebook
Creating R matrix
matrix([[ -1, -1, -1, -1, 0, -1], [ -1, -1, -1, 0, -1, 100], [ -1, -1, -1, 0, -1, -1], [ -1, 0, 0, -1, 0, -1], [ -1, 0, 0, -1, -1, 100], [ -1, 0, -1, -1, 0, 100]])
Creating Q matrix
matrix([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
Assuming Gamma Parameter
Assuming the agent is Room no. 1
Creating function available_actions and as the initial state is 1, we’ll check the row no. 1 for the values ≥ 0 from the R matrix. Because these values represent the nodes we can travel to.
Storing in available_act variable
Defining a function sample_next_action which is randomly going to choose in the initial state and is going to store it in the next_action. Action variable will represent the next available action to take.
We are performing the upper formula of Q, and max index returns the actions that give us the highest Q value. The matrix is simply updated with the update statement.
The algorithm will perform better the more it is trained. This will run for 10,000 iterations before returning the best possible result.
Normalization because the value can exceed.
Trained Q matrix: [[ 0. 0. 0. 0. 80. 0. ] [ 0. 0. 0. 64. 0. 100. ] [ 0. 0. 0. 64. 0. 0. ] [ 0. 80. 51.2 0. 80. 0. ] [ 0. 80. 51.2 0. 0. 100. ] [ 0. 80. 0. 0. 80. 100. ]]
Executing a loop with the current state as 1 on our trained model.
Printing the path the machine is going to take.
Selected path: [2, 3, 1, 5]
This fixes the issue of navigating from the inside to the outside. It’s a simple problem! However, the principle underlying it is still in its early phases of development and is being used on large projects around the world with some success.
Thank you for taking the time to read my post; I hope it has inspired you to consider the scope of what a machine may accomplish with its processing capacity.
I would suggest you read this article: http://karpathy.github.io/2016/05/31/rl/