RL: The foundational elements of the Reinforcement Learning (hello) world

Branavan Selvasingham
Learning Reinforcement Learning
4 min readOct 27, 2022


This is a high-level summary of the main take-aways from the various resources (links at the bottom) I’ve used to get ramped up in this domain.

So you’ve chosen your platform to begin your RL journey. I chose OpenAI Gym (link to post below).

Now you want to quickly get to a point where you can sit back like a proud parent and watch your agent interact and learn, through trial and error, it’s way through the environment and find a way to maximize the reward objective that you defined. Hmmmm… sounds eerily familiar to nature and how human and animal psychology evolves over time.

Wait. Are we in a reinforcement learning simulation?? Ok lets back up. (Some interesting tangents and experiments from this will be discussed much later)

Let’s start slow and get a good handle on all the foundational elements involved (from one student to another).

Here are high-level summaries of the 5 foundational elements to understand: wind, water, fire… just kidding… Agent, Environment, State, Action, Reward.

The foundational elements of RL (image by author)

Note: The below groupings are somewhat subjective as the content overlaps quite a bit across the different elements.


The environment is your hermetically sealed world. There is an underlying problem statement or maximization objective contained within the environment. The environment is simulated / designed / built with defined rules and dynamics that allows for an Agent to interact with and be returned with the new state and associated reward for that action.

Side note: As far as I have seen, the environment’s rules and dynamics can not be influenced by the agent but I wonder if that’s true when dealing with use cases such as an RL Agent making large transactions in a Capital Markets scenario where the environment’s rules and dynamics might get impacted by the Agent’s actions.


The Agent is the only way you can interact with the environment. It is essentially the model you’re looking to train, embodied (even if just as a dot) within the environment. It is a ‘physically’ present participant in the environment based on all the use cases and examples I’ve seen. In other words it is not omni-present through out the environment and is not simply a passive observer. It plays an active role.

One of the interesting variations can arise from whether or not you’re in a single agent or multi agent scenario. The distinction is fairly self-explanatory but the interesting aspect is that even in a multi-agent setting, for any specific agent, the other agents are simply part of the environment (and are represented in the state).

The decision process used by the agent to determine the action to take is called a “Policy”. This is where the Markov Decision Process (MDP) comes into play and you can learn more about that in the links provided below and above. The Policy is essentially the agent’s brain and often becomes a synonym for agent.


The only way the Agent is able to interact with the environment is through a defined set of Actions. This could consist of “Move Left”, “Move Right”, “Fire main thrusters at 75%”, and so on. Once the Action is taken at time t within the environment, the agent must observe the results from the environment, via State and Reward.

The available action “set” is referred to as the action space, and broadly come in two categories: discrete and continuous. Discrete action space is your typical simplified finite movements (think old-school Nintendo game controller) “up”, “down”, “left”, or “move bishop to XY”. Continuous action space is a more real-world use-case such as “acceleration”, “speed”, “turn angle”, etc.

The other main consideration is whether the agent’s actions are episodic or sequential. Meaning, do the actions stand-alone for each iteration (eg. learning to play ping-pong) or does it need to account for all the previous actions that led to the current state (eg. learning to play chess).


A state, in the truest sense, is a complete representation of the world at a given time, t. However, the agent sees the state through observations. The observations may be complete representations of the state (like knowing the full Chess board) or the observations may be partial representations, such as what an autonomous driving agent sees (usually limited to its own first-person perspective).


Reward is the main thing your agent (via it’s policy) is trying to maximize. With each action, the agent is trying to maximize this reward, though sometimes the policy should allow for bypassing a local maxima in order to find a bigger (if not global maxima). There is quite a bit of interesting thought that needs to go into designing the reward system for the environment itself because this is a bit non-trivial and there isn’t always an easy answer for what is the right reward increment for a given action. The most interesting thing, which may seem obvious to some, is that the reward is always a scalar unit. Whether you’re trying to win chess, drive cars, or maximize happiness, it gets boiled down to a scalar number that the agent tries to maximize.


These are a high-level overview of the elements and concepts needed to get you through the door. Now that you’re in the RL world, let’s continue to dive deeper in the next posts.


Awesome starter resources that I’ve found particularly helpful:

  1. RL Course by David Silver (DeepMind) at University College London playlist:

2. OpenAI Spinning Up by Josh Achiam:

3. Reinforcement Learning: An Introduction (textbook) by Richard S. Sutton and Andrew G. Barto:



Branavan Selvasingham
Learning Reinforcement Learning

Perpetual learner. Trying to share the lessons learned. Building and managing production AI solutions.