What is exploration–exploitation dilemma?

In reinforcement learning or in general, human or agents need to explore the environment to make decisions and choose best options from alternatives. If we cannot know the alternatives, we cannot choose the best option. That’s why, exploration is a must to optimize our selection. Exploitation is to use our sources or alternatives to select best options from limited or selected resources. When we or agents explore, they trade-off something to get better exploitation environment. It is problem in games or in some setups. Generally there are different problems to reach maximum gain in environments. We can see the human civilazition as exploring and exploiting problem.

For some purposes, what is the needed maximized resources for our needs? That’s the main problem. However, if we don’t know other choices and alternatives or cannot create them, we cannot reach the optimal solutions. Creativity in nature could be best method to reach optimal solution with mininum exploration.

Maybe according to environment, we need more exploration or many times, the exploration function provides good gain for our agent. That’s why, first we must structure the problem and know the environment which surrounds us. Generally games and other problems are competative ones. That means, the agents should compare their gain with other agents. This is one of the results of entrophy in physics. We can think war, fights and other facts as result of entrophy. However, when group of people fights the information gain rises in the end of t time, we can say it is beneficical for loss decrease of entrophy. That’s why, we need to formulate mathematical benefit of functions in environment.

Simple Competetive Game

In this game, we must compare two agents according to their total game in ened of t time. If the one agent’s gain is bigger than another, we add a score this agent.

First I formulate a simple formula for explore and exploit.

Total Gain = Exploit * t1 + t2 * ( Exploit + Explore* ExploreFactor) — Explore Cost

According to this formula, we use our sources to exploit with lineer function. This provides a “Exploit” gain us in t time. When we Explore something, we add Exploit some beneficial increase with cost of exploring. According to this, we try to find which is beneficial for our agents.

We can see this game as two students creating projects and learning from environments and getting points from a supervisor. Therefore, we can look into the code body I create.


import pandas as pd
import seaborn as sns
class agent_1:
gain = 0
exploit = 10
explore = 30
exploreFactor = 0.3
score = 0


class agent_2:
gain = 0
exploit = 20
explore = 30
exploreFactor = 0.3
score = 0
agent_1 = agent_1()
agent_2 = agent_2()
agent_1_gainList = []
agent_2_gainList = []
time = []
for i in range(20):
agent_1.gain += agent_1.exploit
agent_2.gain += agent_2.exploit
agent_1_gainList.append(agent_1.gain)
agent_2_gainList.append(agent_2.gain)
time.append(i)
if i % 3 == 0:
agent_1.exploit += agent_1.explore * agent_1.exploreFactor

if i % 5 == 0:
if agent_1.gain > agent_2.gain:
agent_1.score += 1
elif agent_2.gain > agent_1.gain:
agent_2.score += 1
else:
pass

print(agent_1.score,agent_2.score)
Agent_1GainTable = {'time':time,'gain':agent_1_gainList}
Agent_2GainTable = {'time':time,'gain':agent_2_gainList}
df_1 = pd.DataFrame.from_dict(Agent_1GainTable)
df_2 = pd.DataFrame.from_dict(Agent_2GainTable)
sns.set(style="whitegrid")
ax = sns.barplot(x="time", y="gain", data=df_1)

So in every 3 time, we explore the environment in agent 1. When we explore, in learning or from beneficial source, learning could be always helpful and beneficial. That’s why we add some value to exploit variable.

In the end we can see that gain of agent_2

and gain of agent_1

with 50 iterate we reach agent_1: 9 and agent_2: 1

with 2 iterate we reach agent_1: 0 and agent_2: 1

Although we have nice increment at exploring, in little time, other agent could be successfull. That’s why, we can make analogy that, even the other agents or agent is more sophisticated, the unadvanced one could be more successfull in little time.

The Explore Function

We can add two different explore function to agents. The one way to measure explore benefit is measuring change of benefit of explore gain data. We can predict this data with machine learning algorithm or time series.

We can see the functional body as this:

import pandas as pd
import seaborn as sns
import math
class agent_1:
gain = 0
exploit = 10
explore = 30
exploreFactor = 0.3
score = 0

def explorefunction(self):

self.exploit += self.explore ** 3 - 28*self.explore**2 - 250


class agent_2:
gain = 0
exploit = 10
explore = 30
exploreFactor = 0.3
score = 0

def explorefunction(self):

self.exploit += math.exp(self.explore) - self.explore**15
agent_1 = agent_1()
agent_2 = agent_2()
agent_1_gainList = []
agent_2_gainList = []
time = []
for i in range(100):
agent_1.gain += agent_1.exploit
agent_2.gain += agent_2.exploit
agent_1_gainList.append(agent_1.gain)
agent_2_gainList.append(agent_2.gain)
time.append(i)
if i % 3 == 0:
agent_1.explorefunction()
agent_2.explorefunction()

if i % 5 == 0:
if agent_1.gain > agent_2.gain:
agent_1.score += 1
elif agent_2.gain > agent_1.gain:
agent_2.score += 1
else:
pass

print(agent_1.score,agent_2.score)

Score is agent_1 : 19 agent_2 : 0

According to the functions agent 1 or agent 2 is winning.

Determining Explore Function

The hard part of modeling the total gain structure, we don’t know the general environment and we need a functional body to determine explore addition to total gain. In this problem, we should able to model the information sources. For example, one agent must understand the true need of supervisor or transform these information gain into the reward of supervisor.

For instance, when we learn to walk, the supervisor is nature the physical law itself. When we can walk properly, nature rewards us. Indeed, we should able to distinguish the correct way of walking and not walking.

Thus, generally the supervisor in games or nature is the mathematical structure of games or nature. We should able to model the supervisor and transform it a usable structure in model. This is not easy task as said and can vary in every movement and environment.

If we can model the information source structure, we should do. Otherwise we must explore with no information.

Exploration Space

For example , we have a space like this. Yeah, first thing to do is to configure the space of agent and the goal. This can be done by supervised which is another issue. When we configure the environment space of exploration, we should reach our purpose or goal with minimum state with reinforcement learning.

for Example a pathaway to goal could be like this

Agent → A(A¹ → A² →A⁴ →A³) → B(B² → B³ → B¹ →B⁴) → C(C …) → Goal

We have several problem here.

Naturally, there can be paths to goal which jumps from A to B or A to C and ther e can be states from B to D and so on.

Some clusters may not be inside of another cluster fully.Thus, we could classify these cluster types according to their behaviours.

Information gain in graphs

We can say that when we travel on nodes in clusters, the degree of node could be reward for search function. if we look 3 nodes and which has most degree we select this node to search on. Because naturally, this node has more information than others. We want to maximize our information in fastest way. Thus, we select the best information nodes and theorically, when we explore this way, we could find the goal node fastest in natural structures.