Reinforcement Learning and Game Theory- An Intuitive Understanding

Published in

Analytics Vidhya

7 min readOct 16, 2019

In the recent years, machine learning and deep learning techniques depicts a remarkable performance in different fields like speech processing, forecasting, computer vision, machine translation, prediction, robotics etc.. The nature of various machine learning concepts are as follows:

Machine learning concepts at a glance (source)

1. Reinforcement Learning

What makes RL unique? The Reinforcement learning helps machines learn to decide actions that are on-par or in some cases far-beyond human understanding by utilizing defined parameters.

The Reinforcement learning(RL) is a goal oriented learning, where a agent is trained in a environment to reach a goal by choosing a best possible actions. For every action, a positive or negative reward can be defined. These rewards serve as remarks, from which our model can learn from mistakes and continue to improve with each iterations.

1.1 Platforms

The platforms are used to model RL environments, which is usually a finite and stationary space used to train the agents. The horizon of the agent in the environment can be finite or infinite based on the requirement. But, the environment is usually modeled as a stochastic finite machine space in which the agents observe and take actions to maximize the rewards.

The comparison of frequently used opensource RL platforms is given below. For detailed information, check out Overview of open source RL platforms.

Comparison of various platforms for training RL models (source)

1.2 Parameters

The necessary parameters that are considered to train the RL model are as follows:

Agent : single or multiple agents
State : S
Actions : A(S), A
Model : T(s,a,s’) ~ Pr(s’|s,a)
Rewards : R(s), R(s, a), R(s,a,s’)
Policy : π(s) = a, π*(s)

1.3 RL-API

Some of the frequently used interfaces in RL are as follows:

Planner : model(T, R) — policy
Learner : transitions(s,a,r,s’) — policy
Modeler : transitions — model
Simulator : model — transitions
RL-Based Planner :
model-Simulator-Learner-policy
Model-Based Planner :
transitions-Modeler-Planner-policy

2. Reinforcement Learning Algorithms

2.1 Markov’s Decision Process

MDP is a decision making algorithm to provide machines in finite environment space with best possible state of actions, in order to maximize their rewards. MDP usually considers only the present state. In case, we need past behavior to be considered for making further actions, we have to incorporate those information with the current state.

To solve MDP, we consider the following two assumptions:
● Infinite horizon in stationary world (note: cases with finite horizon or time maybe possible. But, for this article, we only considered infinite)
● Utility of sequences i.e state of preferences (helps to choose best policy)

2.1.1 Value Function

The utility(value) can be calculated using Bellman’s equation. Where, 𝛾 is a
discount factor to change infinite timestamp sequence to finite values. The
value iteration can be carried out as follows:
● Start with arbitrary utility
● Update utilities based on neighbors using
● Repeat until convergence

2.1.2 Policy Iteration

The pseudo code for Policy iteration is as follows:

2.1.3 MDP Approach

The three frequently used MDP approaches are:
● Policy search : state-policy-action
● Value based : state-utility-value
● Model based : (s,a)-(transition, reward)-(s’,r)

These three approaches can be related as follows:

2.2 Q-Learning

Q-Learning is action-value function, that has been introduced to further optimize the agent choices and calculate optimal utility and optimal policy without needing to learn transition probability or reward function. It can be defined as value of agent arriving in state(s), while taking action(a) and proceeding optimally thereafter.

The pseudo code for Q-learning is as follows (where, α-learning rate):

2.2.1 Ɛ-Greedy Exploration

The Ɛ-greedy exploration is a way of selecting random. It use Greedy limit infinite exploration (GLIE) to decay Ɛ over time. This can be considered as fundamental trade off in RL to reduce sub-optimal regret.

So, with time the optimal Q (learn — exploration) and π (use — exploitation) will be calculated. The Exploration-Exploitation problems like slot machines are the best example to understand Ɛ-greedy exploration.

2.3. Game Theory

Game theory is generally defined as the mathematics of conflicts and are being utilized in various fields like economy, psychology, AI, sociology etc.. In Game theory w.r.t RL, the policy is the strategy, mapping all possible state of actions, in respect to one of the players of the game.

The types of games in Multi-Agent RL (MARL) are:
● Static Games : players are independent and make simultaneous decision
● Stage Games : the rules depend on specific stages
● Repeated Games : when a game is played in sequence

The basic game for understanding game theory is “2-player zero-sum finite deterministic game of perfect information”. The further insight can be obtained by modifying various parameters like non-deterministic, hidden information, non-zero sum. Andrew Moore materials have wide range of games, which provide insight on various game theory concepts.

2.3.1 Nash Equilibrium

Consider n players in a game with strategies s1,s2,…sn. Then the strategies are said to be in NE, if and only if, strategies of all the players corresponds to the optimal strategy for that player. The 3 fundamental theorem for NE are:
● In the n-player pure strategy game, if elimination of strictly dominated
strategies, eliminates all but one combination, then it is the unique NE.
● Any NE will survive elimination of strictly dominated strategies.
● If n is finite and s(i) is finie, there exist at least one NE.

2.3.2 IDP Strategies

Tit for Tat and Grim trigger are some of the famous strategies to deal with
iterated prisoner’s dilemma (Repeated games).

The state of action for a player in TFT is decided using the following strategy:
● Cooperate on first round
● Copy opponents previous move thereafter

The state of action for a player in Grim Trigger is as follows:
● Continue to cooperate with other players
● If any player defect even once, then continue to defect forever

2.3.3 Minmax Profile

● Minmax profile generally represents the pair of payoffs, one for each
player, that represents the payoff that can be achieved by a player by
defending itself from a malicious randomized adversary.
● So, it’s basically like zero-sum games, where each player will try to reduce
rewards of the other player.
● Minmax profile is used with pure strategy to find payoffs and in case of
mixed strategy, we will use security level payoffs.

2.3.4 Folk Theorem

● The general idea of Folk theorem states that “In repeated games, the
possibility of retaliation opens the door for cooperation”.
● In Game theory, it refers to the particular result. It can be described as a
set of payoffs, that can result in from Nash strategies in repeated games.
● We can obtain feasible region (avg. payoff of joint strategy) and
acceptable region(preferable — Minmax profile)
● Any feasible payoff profile that strictly dominates the minmax/security
level can be realized as a Nash equilibrium payoff, with sufficiently large
discount factor. Because, if it strictly dominates the minmax profile, it can
use it as a threat.

2.3.5 Subgame perfect

● A strategy is said to be subgame perfect, if a player always choose best
response, independent of the history.
● In simple words, subgame perfect avoids implausible threats and
continues to choose a best state of action over time.
● TFT and Grim trigger are in Nash equilibrium, but are not subgame
perfect.

2.3.6 Pavlov strategy

Pavlov is yet another IDP strategy, where players cooperate if agree, defects if disagree. Pavlov satisfies both Nash equilibrium and subgame perfect. Computational folk theorem can be used to build Pavlov-like machines. CFT can also build subgame perfect Nash equilibrium for any games in polynomial time. Some advantages of using CFT are:
● Pavlov if possible
● Zero-sum like (2-player)
● At least one player improves

3. Acknowledgment:

This article is a basic overview of RL. Currently, I am working on a RL tutorial series which will give readers a brief understanding on various parameters and techniques in RL with practical implementation using python. The tutorial series is being created based on my understanding from RL materials from The Georgia Tech, Practical RL by National Research University Higher School of Economics, and various blogs from experts in the field(I have mentioned the links for the same).