Autonomous Navigation using Deep Reinforcement Learning

Pranay Kumar

Published in

CodeX

13 min readSep 10, 2021

Implementation of Deep Q-Learning Algorithms to solve Banana-Collector Unity ML-Agent Navigation Problem Statement

A brief introduction to the Problem Statement

Using a simplified version of the Banana Collector Unity ML-Agent environment, the objective of the project is to train an agent to navigate and collect only yellow bananas in a large, square world. A reward of +1 is provided for collecting a yellow banana, and a reward (i.e. penalty) of -1 is provided for collecting a blue banana. Thus, the goal of the agent is to collect as many yellow bananas as possible while avoiding blue bananas. The agent’s observation space is 37 dimensional and the agent’s action space is 4 dimensional (forward, backward, turn left, and turn right). The task is episodic, and in order to solve the environment, the agent must get an average score of +13 over 100 consecutive episodes.

Reinforcement Algorithms Implemented :

a) Vanilla Deep Q-Learning

b) Double Deep Q-Learning

c) Double Deep Q-Learning with Prioritized Experience Replay

Results Showcase :

An Untrained Agent navigating at random in the environment

A Trained Agent navigating with the goal of collecting yellow bananas and avoiding blue bananas

State Space :

The observations are in a 37-dimensional continuous space corresponding to 35 dimensions of ray-based perception of objects around the agent’s forward direction and 2 dimensions of velocity. The 35 dimensions of ray perception are broken down as: 7 rays projecting from the agent at the following angles (and returned back in the same order): [20, 90, 160, 45, 135, 70, 110] where 90 is directly in front of the agent. Each ray is 5 dimensional and it is projected onto the scene. If it encounters one of four detectable objects (i.e. yellow banana, wall, blue banana, agent), the value at that position in the array is set to 1. Finally there is a distance measure which is a fraction of the ray length. Each ray is [Yellow Banana, Wall, Blue Banana, Agent, Distance]. The velocity of the agent is two dimensional: left/right velocity and forward/backward velocity. The observation space is fully observable because it includes all the necessary information regarding the type of obstacle, the distance to obstacle, and the agent’s velocity. As a result, the observations need not be augmented to make them fully observable. The incoming observations can thus be directly used as state representation.

Action Space :

The action space is 4 dimensional. Four discrete actions correspond to:

a) 0 : move forward

b) 1 : move backward

c) 2 : move left

d) 3 : move right

Solution Criteria :

The environment is considered as solved when the agent gets an average score of +13 over 100 consecutive episodes.

Relevant Concepts

This section provides a theoretical background describing current work in this area as well as concepts and techniques used in this work.

a) Reinforcement Learning : Reinforcement Learning has become very popular with recent breakthroughs such as AlphaGo and the mastery of Atari 2600 games. Reinforcement Learning (RL) is a framework for learning a policy that maximizes an agent’s long-term reward by interacting with the environment. A policy maps situations (states) to actions. The agent receives an immediate short-term reward after each state transition. The long-term reward of a state is specified by a value-function. The value of state roughly corresponds to the total reward an agent can accumulate starting from that state. The action-value function corresponding to the long-term reward after taking action a in state s is commonly referred to as Q-value and forms the basis for the most widely used RL technique called Q-Learning

b) Temporal Difference Learning : Temporal Difference learning is a central idea to modern day RL and works by updating estimates for the action-value function based on other estimates. This ensures the agent does not have to wait until the actual cumulative reward after completing an episode to update its estimates, but is able to learn from each action.

c) Q-Learning : Q-learning is an off-policy Temporal Difference (TD) Control algorithm. Off-policy methods evaluate or improve a policy that differs from the policy used to make decisions. These decisions can thus be made by a human expert or random policy, generating (state, action, reward, new state) entries to learn an optimal policy from.

Q-learning learns a function Q that approximates the optimal action-value function. It does this by randomly initializing Q and then generating actions using a policy derived from Q, such as e-greedy. An e-greedy policy chooses the action with the highest Q value or a random action with a (low) probability of , promoting exploration 
as e (epsilon) increases. With this newly generated (state (St), action (At), reward (Rt+1), new state (St+1)) pair,
Q is updated using rule 1.
 
This update rule essentially states that the current estimate must be updated using the received immediate reward plus a discounted estimation of the maximum action-value for the new state. It is important to note here that the update is done immediately after performing the action using an estimate instead of waiting for the true cumulative reward, demonstrating TD in action. The learning rate α decides how much to alter the current estimate and the
discount rate γ decides how important future rewards (estimated action-value) are compared to the immediate reward.

d) Experience Replay: Experience Replay is a mechanism to store previous experience (St, At, Rt+1, St+1) in a fixed size buffer. Minibatches are then randomly sampled, added to the current time step’s experience and used to incrementally train the neural network. This method counters catastrophic forgetting, makes more efficient use of data by training on it multiple times, and exhibits better convergence behavior

e) Fixed Q Targets : In the Q-learning algorithm using a function approximator, the TD target is also dependent on the network parameter w that is being learnt/updated, and this can lead to instabilities. To address it, a separate network with identical architecture but different weights is used. And the weights of this separate target network are updated every few steps to be equal to the local network that is continuously being updated.

Description of the Learning Algorithms used

Deep Q-Learning Algorithm : In modern Q-learning, the function Q is estimated using a neural network that takes a state as input and outputs the predicted Q-values for each possible action. It is commonly denoted with Q(S, A, θ), where θ denotes the network’s weights. The actual policy used for control can subsequently be derived from Q by estimating the Q-values for each action give the current state and applying an epsilon-greedy policy. Deep Q-learning simply means using multilayer feedforward neural networks or even Convolutional Neural Networks (CNNs) to handle raw pixel input
Double Deep Q-Learning Algorithm : Deep Q-Learning is based upon Q-learning algorithm with a deep neural network as the function approximator. However, one issue that Q-learning suffers from is the over estimation of the TD target in its update equation. The expected value is always greater than or equal to the greedy action of the expected value. As a result, Q-learning ends up overestimating the q-values thereby degrading learning efficiency. To address it, we use the double Q-learning algorithm where there are two separate q-tables. And at each time step, we randomly decide which q-table to use and use the greedy action from one q-table to evaluate the q-value of the other q-table
Prioritized Experience Replay with Double Deep Q-Learning Algorithm : For memory replay, the agent collects tuples of (state, reward, next_state, action, done) and reuses them for future learning. In case of prioritized replay the agent has to assign priority to each tuple, corresponding to their contribution to learning. After that, these tuples are reused based on their priorities, leading to more efficient learning.

Two new parameters are introduced for this implementation :

a) ALPHA : Prioritization Exponent which can be tweaked to determine how much factor random sampling could be reintroduced to avoid overfitting by just using Prioritized Experience Samples A value of 1 for ALPHA corresponds to using only Prioritized Experience Samples a VALUE OF 0 for ALPHA corresponds to using only experience samples at random

b) BETA : Importance Sampling Weights Exponent which is used to determine by how much factor are the weights of Q-net model get modified while training The value of BETA parameter can be gradually increased over training to give more importance to weights getting updated during the later stages of training when the model is finally converging to the expected result

Neural Net Architecture Used:

A multilayer feed-forward Neural Net Architecture was used with 2 Hidden layers each having 64 hidden neurons A ReLU (Rectified Linear Unit) Activation Function was used over the inputs of the 2 hidden layers I tried initializing the weights as well in one implementation to see whether the learning of the model increases but did not find much difference in the results achieved without Weight initialization of the Neural Net Layers I tried to decay the learning rate as well in a modified implementation to achieve quicker results without any significant improvement in the training of the model

Besides in the Double Deep Q-Net implementation Rewards achieved during training were clipped to be in the range of -1 to 1 to remove outliers during training

Hyperparameters Used:

Number of Episodes : 5000
Max_Timesteps : 1000
Eps_start : 1 (Beginning Epsilon value used in e-greedy policy)
Eps_End : 0.01 (Lower Limit Epsilon value used in e-greedy policy)
Eps_Decay : 0.995 (factor by which Epsilon value gets reduced)
BUFFER_SIZE : int(1e5)
BATCH_SIZE : 64
GAMMA : 0.99 (Discount Rate)
TAU : 1e-3 (for soft update of target parameters)
Target Goal Score : greater than or equal to 16

Extra Parameters for Prioritized Experience Replay Implementation :

ALPHA : 0.6 (Prioritization Exponent)
INIT_BETA : 0.4 (Importance Sampling Exponent)

Plot of Rewards per Episode

Deep Q-learning Algorithm Results (Best Result achieved using Learning Rate 1e-4) :

A score of +16 achieved in 964 episodes using Learning Rate 1e-4

2. Double Deep Q-learning Algorithm Results (Best Result achieved by Clipping the Rewards in the range of -1 to 1) :

A score of +16 achieved in 816 episodes

3. Prioritized Experience Replay with Double Deep Q-learning Algorithm Results :

A score of +16 achieved in 732 episodes

CONCLUSION

As is evident from the results achieved above by implementation of 3 different Algorithms Prioritized Experience Replay with Double Deep Q-net Algorithm performs the best conversing to the goal score in just 732 episodes whereas Double Deep Q-net Algorithm converges to Goal sore in 816 episodes and a simple Deep Q-net Algorithm implementation takes the maximum number of episodes i.e. 964 to get to the target score of +16

Best Results were achieved with Learning Rate of 1e-4 in all the 3 implementations and in case of Double Deep Q-Net implemenation rewards achieved during training were clipped to be in the range of -1 to 1 to remove outliers during training

Installation Instructions to setup the Project:

1) Setting Up Python Environment :

a) Download and install Anaconda 3 (latest version 5.3) from this link (https://www.anaconda.com/download/)for the specific Operating System and Architecture (64-bit or 32-bit) being used for Python 3.6 + version onwards
    
 b) Create (and activate) a new environment with Python 3.6.:
    Open Anaconda prompt and then execute the below given commands
 
    Linux or Mac:
    conda create --name drlnd python=3.6
    source activate drlnd
    
    Windows:
    conda create --name drlnd python=3.6 
    activate drlnd
    
 c) Minimal Installation of OpenAi Gym Environment
    Below are the instructions to do minimal install of gym :    git clone https://github.com/openai/gym.git
    cd gym
    pip install -e .
     
A minimal install of the packaged version can be done directly from PyPI: pip install gym
     
 d) Clone the repository (https://github.com/udacity/deep-reinforcement-learning.git) and navigate to the python/ folder.
Then, install several dependencies by executing the below commands in Anaconda Prompt Shell :
      
git clone https://github.com/udacity/deep-reinforcement-learning.git
cd deep-reinforcement-learning/python
pip install . (or pip install [all] )
      
 e) Create an Ipython Kernel for the drlnd environment :
python -m ipykernel install --user --name drlnd --display-name "drlnd"
      
 f) Before running code in a notebook, change the kernel to match the drlnd environment by using the drop-down Kernel menu.

2) Install Unity ML-Agents associated libraries/modules:

Clone the GitHub Repository (https://github.com/Unity-Technologies/ml-agents.git)and install the required libraries by running the below mentioned commands in the Anaconda Prompt :git clone https://github.com/Unity-Technologies/ml-agents.git
cd ml-agents/ml-agents (navigate inside ml-agents subfolder)
pip install . or (pip install [all]) (install the modules required)

3) Download the Unity Environment :

a)For this project, Unity is not necessary to be installed because readymade built environment has already been provided, and can be downloaded from one of the links below as per the operating system being used:Linux: https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana_Linux.zip
Mac OSX: https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana.app.zip
Windows (32-bit): https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana_Windows_x86.zip
Windows (64-bit): https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana_Windows_x86_64.zip
 
Place the downloaded file in the p1_navigation/ as well as python/ folder in the DRLND GitHub repository, and unzip (or decompress) the file.b)(For AWS) If the agent is to be trained on AWS (and a virtual screen is not enabled), then please use this link (https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana_Linux_NoVis.zip) to obtain the "headless" version of the environment. Watching the agent during training is not possible without enabling a virtual screen.(To watch the agent, follow the instructions to enable a virtual screen (https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-on-Amazon-Web-Service.md)
and then download the environment for the Linux operating  system above.)

Details of running the Code to Train the Agent / Test the Already Trained Agents :

First of all clone this repository (https://github.com/PranayKr/Deep_Reinforcement_Learning_Projects.git) on local system.
Also clone the repository (https://github.com/udacity/deep-reinforcement-learning.git) mentioned previously on local system.
Now place all the Source code files and pretrained model weights present in this cloned GitHub Repo inside the python/ folder of the Deep-RL cloned repository folder.
Next place the folder containing the downloaded unity environment file for Windows (64-bit) OS inside the python/ folder of the Deep-RL cloned repository folder.
Open Anaconda prompt shell window and navigate inside the python/ folder in the Deep-RL cloned repository folder.
Run the command “jupyter notebook” from the Anaconda prompt shell window to open the jupyter notebook web-app tool in the browser from where any of the provided training and testing source codes present in notebooks(.ipynb files) can be opened.
Before running/executing code in a notebook, change the kernel (IPython Kernel created for drlnd environment) to match the drlnd environment by using the drop-down Kernel menu.
The source code present in the provided training and testing notebooks(.ipynb files) can also be collated in respective new python files(.py files) and then executed directly from the Anaconda prompt shell window using the command “python <filename.py>”.

NOTE:

All the cells can executed at once by choosing the option (Restart and Run All) in the Kernel Tab
Please change the name of the (*.pth) file where the model weights are getting saved during training to avoid overwriting of already existing pre-trained model weights existing currently with the same filename

a) Vanilla Deep Q-net algorithm Training / Testing Details (Files Used) :

For Training : Open either of the below mentioned Jupyter Notebook and execute all the cells
    
    1) DeepQ-Net_Navigation_Solution-LR_(1e-4).ipynb (using Learning Rate Hyperparameter val : 1e-4)
    2) DeepQ-Net_Navigation_Solution-LR_(5e-4).ipynb (using Learning Rate Hyperparameter val : 5e-4)
    3) DeepQ-Net_Navigation_Solution-LR_(5e-5).ipynb (using Learning Rate Hyperparameter val : 5e-5)
    
    Neural Net Model Architecture file Used : NN_Model.py
    The Unity Agent file used : DeepQN_Agent.py
    
    For Testing : open the Jupyter Notebook file "DeepQNet_Test.ipynb" and run the code to test the results obtained using Pre-trained model weights
                  
    Pretrained Model Weights provided : 1)DQN_Checkpoint.pth
                                        2)DQN_Checkpoint_2.pth
                                        3)DQN_Checkpoint_3.pth

b) Double Deep Q-net algorithm Training / Testing Details (Files Used) :

For Training : Open either of the below mentioned Jupyter Notebook and execute all the cells
    
    1) DoubleDeepQ-Net_Navigation_Solution.ipynb 
    2) DoubleDeepQ-Net_Navigation_Solution2.ipynb 
    3) DoubleDeepQ-Net_Navigation_Solution-RewardsClipped.ipynb
    4) DoubleDeepQ-Net_Navigation_Solution-RewardsClipped-LRDecay.ipynb
    
    Neural Net Model Architecture file Used : DDQN_NN_Model.py
    The Unity Agent file used : 
    1) DoubleDeepQN_Agent.py
    2) DoubleDeepQN_Agent_WeightsInitialized.py
    
    For Testing : open the Jupyter Notebook file 
"DoubleDeepQ-Net_Test.ipynb" and run the code to test the results obtained using Pre-trained model weights
                  
    Pretrained Model Weights provided : 
    1)DoubleDQN_Checkpoint_1.pth
    2)DoubleDQN_Checkpoint_1_RewardsClipped.pth
    3)DoubleDQN_Checkpoint_2.pth

c) Double Deep Q-net with Priority Experience Replay algorithm Training / Testing Details (Files Used) :

For Training : Open the below mentioned Jupyter Notebook and execute all the cells
PrioritizedExpReplaay_DoubleDeepQ-Net_Navigation_Solution.ipynb 
   
Neural Net Model Architecture file Used : DDQN_NN_Model.py
The Unity Agent file used : PriorityExpReplay_DoubleDeepQN_Agent.py
    
For Testing : open the Jupyter Notebook file "PriorityExp_DoubleDeepQ-Net_Test.ipynb" and run the code to test the results obtained using Pre-trained model weights.
                  
Pretrained Model Weights provided : PriorityExpDoubleDQN_Checkpoint_1.pth

NOTE :

This article is also published in LinkedIn ( https://www.linkedin.com/pulse/implementation-deep-reinforcement-learning-algorithms-pranay-kumar )
If you wish to connect drop me a message over LinkedIn ( https://www.linkedin.com/in/pranay-kumar-02b35524/ ) or email at pranay.scorpio9@gmail.com