Introduction to Reinforcement Learning Using Unity ML Agents-Part 1

Ankush k Singal
AI Artistry
Published in
5 min readAug 27, 2023

Overview Of Reinforcement Learning

Reinforcement learning is a type of machine learning in which an agent learns to make decisions by interacting with its environment. The agent receives feedback in the form of rewards and punishments, and it uses this feedback to learn which actions are more likely to lead to rewards.

Reinforcement Learning Components:

  • Agent: The agent is the entity that is learning to behave in the environment.
  • Environment: The environment receives the action the agent wants to take and provides feedback to the agent in the form of reward as well as a transition to a new state based on the action the agent took. Next, the environment provides part of the revised state information to the agent that becomes the agent’s new observation/state.
  • Action: An action is something that the agent can do in the environment.
  • Reward: A reward is a signal that the agent receives after taking an action. Rewards can be positive, negative, or neutral.
  • Policy: A policy is a function that tells the agent what action to take in a given state.
  • State: A state is a description of the agent’s current situation in the environment.

Reinforcement Learning Algorithms

A non-exhaustive, but useful taxonomy of algorithms in modern RL (https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html)

Deep learning provides more optimization by converting the discrete states into continuous distributions and then tries to apply high-dimensional neural networks to converge the loss function to a global minima. This is favored by using algorithms like DQN, double deep-Q-network (DDQN), dueling DQN, actor critic (AC), proximal policy operation (PPO), deep deterministic policy gradients (DDPG), trust region policy optimization (TRPO), soft actor critic (SAC).

Before discussing with Unity ML Toolkit, let us understand the fundamentals of state-based RL.

Markov Processes
The field of reinforcement learning is based on the formalism of Markov processes. Before we dive deep into learning (behavior optimization) algorithms, we need to have a good grasp of this fundamental theoretical construct. In this section, we will go over:

  • Markov chains
  • Markov reward processes
  • Markov decision processes.

Markov chains: Transition probability as the probability of moving to state St+1 from state St in the prior time step.

An example of an episodic Markov chain with one end state depicted by way of a square box

Markov Reward Processe: The Markov reward process (MRP) is defined by (S, P, R, γ), where S are the states, P is the state-transition probability, R_s is the reward, and γ is the discount factor, which will be covered in the coming sections.

The state reward R_s is the expected reward over all the possible states that one can transition to from state s. This reward is received for being at the state S_t. By convention, it’s said that the reward is received after the agent leaves the state, and is hence regarded as R_(t+1).

Markov Decision Process: It extend reward processes by bringing the additional concept of “action.” In MRP, the agent had no control on the outcome. Everything was governed by the environment. However, under the MDP regime, the agent can choose actions based on the current state/observation. The agent can learn to take actions that maximize the cumulative reward, i.e., total return Gt.

Within this section, our attention will be directed towards the process of producing transitional states that bridge various decisions. Additionally, we’ll delve into the development of simulations within the Unity Engine using these transitions. A step-by-step guide will be provided on how the enumeration of states and the utilization of Hidden Markov Models (HMMs) can support an agent in charting an optimal path within a Unity environment, thus enabling the agent to achieve its rewards effectively.

Here is some example code to understand:

import numpy as np
import pandas as pd

transition_mat=np.array([[0.7,0.3],
[0.2,0.8]])

intial_values= np.array([1.0,0.5])

#Transitioning for 3 turns
transition_mat_3= np.linalg.matrix_power(transition_mat,3)

#Transitioning for 10 turns
transition_mat_10= np.linalg.matrix_power(transition_mat,10)

#Transitioning for 35 turns
transition_mat_35= np.linalg.matrix_power(transition_mat,35)

#ouput estimation of the values
output_values= np.dot(intial_values,transition_mat)
print(output_values)
#output values after 3 iterations
output_values_3= np.dot(intial_values,transition_mat_3)
print(output_values_3)
#output values after 10 iterations
output_values_10= np.dot(intial_values,transition_mat_10)
print(output_values_10)
#output values after 35 iterations
output_values_35= np.dot(intial_values,transition_mat_35)
print(output_values_35)

The initial state of the sets is set to 1.0 and 0.5 for S and P, respectively. The transition matrix is initialized as mentioned previously. We then compute the value of the transition matrix for 3, 10, and 35 iterations, respectively, and with the output of each stage, we multiply the initial probability array. This provides us the final values for each state.

Markov Model with Unity ML-Agent(Huggy)

Source: https://huggingface.co/Andyrasika/ppo-Huggy

The game is a simulation where Huggy (Huggy from Unity ML Agents) tries to find the sticks as soon as they are simulated in a Markov process. The sticks are initialized with predefined probability states, and a transition matrix is provided. For each iteration of the simulation, the stick that has the highest self-transition probability gets selected while the rest are destroyed. The task of the Huggy is to locate those sticks at each iteration, providing him a little rest of 6 seconds when he is able to reach one correctly. Since the transition probabilities are computed really fast, the steps taken by Huggy are instantaneous. This is a purely randomized distribution of Markov states where the state transition probability is computed on the go.

LinkedIn: You can follow me on LinkedIn to keep up to date with my latest projects and posts. Here is the link to my profile: https://www.linkedin.com/in/ankushsingal/

GitHub: You can also support me on GitHub. There I upload all my Notebooks and other open source projects. Feel free to leave a star if you liked the content. Here is the link to my GitHub: https://github.com/andysingal?tab=repositories

Requests and questions: If you have a project in mind that you’d like me to work on or if you have any questions about the concepts I’ve explained, don’t hesitate to let me know. I’m always looking for new ideas for future Notebooks and I love helping to resolve any doubts you might have.

Remember, each “Like”, “Share”, and “Star” greatly contributes to my work and motivates me to continue producing more quality content. Thank you for your support!

Resources:

--

--

Ankush k Singal
AI Artistry

My name is Ankush Singal and I am a traveller, photographer and Data Science enthusiast .