Stretched Exponential Decay function for Epsilon Greedy Algorithm

subbiah natarajan
Analytics Vidhya
Published in
4 min readMay 3, 2020

While working on a Reinforcement Learning (RL) project, i was looking for a decay function that can provide the RL agent following characteristics

  1. More dwell time for exploration at initial part of episodes
  2. More exploitation with random exploration at end of episodes (quasi-deterministic policy)
  3. Smooth gradient while switching from exploration to exploitation

While there were several resources in web, i was not able to find a close match to the function that i was looking for. So i ended up concocting a decay function on my own. I also learnt that this sort of decay function is called Stretched Exponential Decay

Expression for Stretched Exponential Decay

In python the code would look like:

A=0.5
B=0.1
C=0.1
def epsilon(time):
standardized_time=(time-A*EPISODES)/(B*EPISODES)
cosh=np.cosh(math.exp(-standardized_time))
epsilon=1.1-(1/cosh+(time*C/EPISODES))
return epsilon

Here the EPISODES are the number of iterations for which we will be training the RL agent. Also there are 3 hyper parameters: A, B and C. We will look in to this in a moment

For the hyper parameter setting above, the decay function will look like this:

Episodes =100,000 A=0.5, B=0.1, C=0.1

The left tail of the graph has Epsilon values above 1, which when combined with Epsilon Greedy Algorithm, will force the agent to explore more

The right tail of the graph has Epsilon values close to zero, which helps the agent to exhibit Quasi-deterministic behavior. This means the agent will be exploiting more at later part of episodes but randomly it can explore as well. Imagine when deploying an RL agent to play again human opponents, the agent’s move can be always guessed, if the agent were to chose the same best action always. So this decay function can be deployed for those situations as well.

There is a transition portion in between left and right tail of the graph, that smoothly transitions agent behavior from exploration to exploitation

The code to check shape of decay function

new_time=list(range(0,EPISODES))
y=[epsilon(time) for time in new_time]
plt.plot(new_time,y)
plt.ylabel('Epsilon')
plt.title('Stretched Exponential Decay Function')

Hyperparameters for decay function

The parameter A decides where we would like the agent to spend more time, either on Exploration or on Exploitation. For values of A below 0.5, agent would be spending less time exploring and more time exploiting. For values of A above 0.5, you can expect the agent to explore more

Decay function for A=0.3, the left tail of the graph has shortened so the agent will be relatively exploring for lesser duration

For A=0.3, the left tail is shortened

Decay function for A=0.7, the left tail of the graph has lengthened so the agent will be exploring for longer duration of time

For A=0.5, the left tail has lengthened

The parameter B decides the slope of transition region between Exploration to Exploitation zone

With B as 0.3, the slope becomes close to 45 degree. Personally, i opt for B=0.1

With B=0.3, the transition portion has gradient = -45 degree

Parameter C controls the steepness of left and right tail of the graph. Higher the value of C, more steep are the left and right tail of the graph. Here as well, i prefer to use C=0.1

With C=0.7, the left and right tail tends to get steeper

Deployment of decay function in Epsilon Greedy Algorithm

The code for Epsilon greedy algorithm will be as follows

def epsilon_greedy(state, time):

z = np.random.random() #provides a number less than 1
state=Q_state(state) #state is provided by environment
if z > epsilon(time):
#for smaller epsilon values, the agent is forced to chose the best possible action
action = <write your code to choose best possible action>
#Exploitation: this gets the action corresponding to max q-value of current state

else:
# for larger epsilon values (close to 1), the agent is forced to explore by choosing a random action
action = <write your code to choose the a random action> #Exploration: randomly choosing and action

return action

--

--

subbiah natarajan
Analytics Vidhya

A Mechanical Engineer by profession and AI -Machine Learning practitioner. I wish to combine best of both worlds to solve the toughest Engineering problems