Stretched Exponential Decay function for Epsilon Greedy Algorithm
While working on a Reinforcement Learning (RL) project, i was looking for a decay function that can provide the RL agent following characteristics
- More dwell time for exploration at initial part of episodes
- More exploitation with random exploration at end of episodes (quasi-deterministic policy)
- Smooth gradient while switching from exploration to exploitation
While there were several resources in web, i was not able to find a close match to the function that i was looking for. So i ended up concocting a decay function on my own. I also learnt that this sort of decay function is called Stretched Exponential Decay
Expression for Stretched Exponential Decay
In python the code would look like:
A=0.5
B=0.1
C=0.1
def epsilon(time):
standardized_time=(time-A*EPISODES)/(B*EPISODES)
cosh=np.cosh(math.exp(-standardized_time))
epsilon=1.1-(1/cosh+(time*C/EPISODES))
return epsilon
Here the EPISODES are the number of iterations for which we will be training the RL agent. Also there are 3 hyper parameters: A, B and C. We will look in to this in a moment
For the hyper parameter setting above, the decay function will look like this:
The left tail of the graph has Epsilon values above 1, which when combined with Epsilon Greedy Algorithm, will force the agent to explore more
The right tail of the graph has Epsilon values close to zero, which helps the agent to exhibit Quasi-deterministic behavior. This means the agent will be exploiting more at later part of episodes but randomly it can explore as well. Imagine when deploying an RL agent to play again human opponents, the agent’s move can be always guessed, if the agent were to chose the same best action always. So this decay function can be deployed for those situations as well.
There is a transition portion in between left and right tail of the graph, that smoothly transitions agent behavior from exploration to exploitation
The code to check shape of decay function
new_time=list(range(0,EPISODES))
y=[epsilon(time) for time in new_time]
plt.plot(new_time,y)
plt.ylabel('Epsilon')
plt.title('Stretched Exponential Decay Function')
Hyperparameters for decay function
The parameter A decides where we would like the agent to spend more time, either on Exploration or on Exploitation. For values of A below 0.5, agent would be spending less time exploring and more time exploiting. For values of A above 0.5, you can expect the agent to explore more
Decay function for A=0.3, the left tail of the graph has shortened so the agent will be relatively exploring for lesser duration
Decay function for A=0.7, the left tail of the graph has lengthened so the agent will be exploring for longer duration of time
The parameter B decides the slope of transition region between Exploration to Exploitation zone
With B as 0.3, the slope becomes close to 45 degree. Personally, i opt for B=0.1
Parameter C controls the steepness of left and right tail of the graph. Higher the value of C, more steep are the left and right tail of the graph. Here as well, i prefer to use C=0.1
Deployment of decay function in Epsilon Greedy Algorithm
The code for Epsilon greedy algorithm will be as follows
def epsilon_greedy(state, time):
z = np.random.random() #provides a number less than 1
state=Q_state(state) #state is provided by environment
if z > epsilon(time):
#for smaller epsilon values, the agent is forced to chose the best possible action action = <write your code to choose best possible action>
#Exploitation: this gets the action corresponding to max q-value of current state
else:
# for larger epsilon values (close to 1), the agent is forced to explore by choosing a random action action = <write your code to choose the a random action> #Exploration: randomly choosing and action
return action