Take the (IoT) world by Reinforcement Learning

Sean Mirchi
grandcentrix
Published in
11 min readMay 13, 2019
Photo by Tincho Franco on Unsplash

Smart services are an important element of the Internet of Things (IoT) ecosystems where the intelligence behind the services is obtained and improved through the sensory data. Providing a large amount of labeled training data is not always feasible. So we need to consider alternative ways that incorporate unlabeled data as well. In recent years, Reinforcement Learning has gained great success in several application domains. It is an applicable method for IoT scenarios where auto-generated data can be partially labeled, but not fully, for training purposes. We will look into details of Reinforcement Learning algorithms here, but first let’s begin with a short introduction.

When you try to solve a problem with a Machine Learning solution, the first question always is, which category does this problem fall into?

There are 3 main categories in Machine Learning (beside some mixtures and sub-categories):

  • Supervised Learning: Where you have labeled data, and basically try to train a model to understand relations between a label and its data. For example, when you try to have a model to recognise if a picture shows a dog or a cat, by training the model with thousands of cats/dogs pictures. You basically tell the model, this is a cat and this is a dog, a thousand times (or even more), and expect the model to understand this afterwards for any given picture.
  • Unsupervised Learning: Here you don’t have labeled data. And you are interested in understanding clusters or groups in your data. For example you would like to put your customers into multiple groups/clusters based on their shopping habits, but the data is too complicated to do it manually and you don’t know if there is actually any relation. Here you can easily use one of the many unsupervised techniques.
  • Reinforcement Learning: Have you ever trained a dog? If you did, you probably already know this case. When training a dog, you usually reward him if he was successful and you punish or ignore him, if he was not successful. It is the same here, in this case you try to train a model (here usually called an Agent) to achieve a specific goal by following some rules in a specific environment. The foundation of these algorithms is an Award/Penalty system. For each action of the agent you should provide an Award or a Penalty, and the agent would try to achieve the most award at the end. This is what we are going to talk about in detail here, so no worries if it feels complicated now.

Reinforcement Learning

You probably heard about AlphaGO, The AI that defeats world Go number one Ke Jie, or even AlphaStar, master in StarCraft II and even more recently OpenAI Five, DOTA 2 world champion. All of these are applications of Reinforcement Learning.

So how does this Reinforcement Learning actually work? This is what we are going to talk about here. We will also go into some implementation details, for this we are going to use OpenAI Gym as our basis.

Basics

The whole process in Reinforcement Learning is divided into different aspects. Now let’s describe it as simple as possible here. Later we will explain all of these in details.

First, you would define an Agent. Basically, this means utilising a machine learning algorithm to enable learning by experience. For example, this can be a neural network or just a simple matrix (usually called Q-Table). Then you would define an environment which contains a set of possible actions and some rules to identify its states and awards/penalties.

Then the training iteration begins. For each iteration, the agent would choose an action and inform the environment about this chosen action. Then the environment will return a number, a positive one, if the action is rewarded, and a negative one, if it is penalised. Additionally, some other data is returned that we ignore for now. This process, choosing an action and receiving a reward/penalty, is called a step. The agent remembers each (Action | Reward/Penalty) combination. This step iteration will repeat until the environment tells the agent that it is done or we just hit the maximum limit of steps possible (that we could define). Now this is called an episode, basically a set of steps that resulted in some Award/Penalty and a series of actions from the agent.

After an episode is finished, we will reset the whole environment and restart from the beginning. Keep in mind that only the environment will restart, but the agent still has all learnings from the previous episode (and from even earlier ones).

This iteration continues for a very long time, until we are confident that our agent understood all aspects of our environment.

When the training is done you can just save the agent, and use it to solve the defined problem easily.

Defining an Environment

As mentioned previously, for Reinforcement Learning you have to define an environment. Creating an environment essentially means defining a clear set of rules. What actions is an agent allowed to take in this environment? What possible states does this environment have? What are the definitions of Rewards and Penalties? What is an episode’s termination condition or done state?

To provide all these rules in an OpenAI Gym environment, you have to define:

  • action_space: possible actions that the agent can select from
  • observation_space: possible states that can be observed in the environment based on an action
  • state: current state of the environment

And define these methods:

  • __init__: initialise the environment with all the default values
  • step: accepts an Action, calculates and returns {new state, reward and done_state} after taking this action
  • reset: clear all the variables in the environment and reset it to its initial state
  • render: provide output for better debugging or showcasing

To make all of these more clear, let’s take a look at a simple implementation.

Implementing an Environment

For the purpose of this post, let's try to solve a simple problem. Let’s imagine we have a field with 4 rows and 6 columns with only 4 chairs in the centre. We would like to train an agent who would prefer sitting on a chair rather than just walking around in the field.

This is our field, with chairs marked as + in the centre

We would try to create an environment for this, and train an agent who would come straight to the chairs and only sits on a chair.

To define a custom environment in Gym, we have to create a class that inherits from Gym’s base environment class that is gym.Env, and define the required methods.

This would be the bare bones of a custom environment. So now let’s define our initial values by defining the__init__ method.

By defining the MAP variable we are defining a visual representation of our environment, so that we can show this in our output by using the render function that we will define later. In __init__ function we are also defining the boundaries of our environment, by defining max_row and max_col, later we can use these variables to detect if the agent went outside the boundaries or not. Two important variables to define here are observation_space and action_space. Observation space defines the possible states in the environment. In our case, as we have a field with 4 rows and 6 columns, the agent can be in 24 possible states (4*6=24), corresponding to each position in our field. For action space, we define every possible action that our agent can take, in our environment it will be 5 actions, four of them for moving around the field and one action for sitting.

In each step the agent would choose one of these actions and sends it to our environment, then we have to define our Reward/Penalty policy and answer with the correspondent Reward/Penalty based on current state and chosen action. For this we have to implement the step function.

As you can see, step function is basically a set of if/else conditions that define a reward for all possible actions and states. As our state consists of two numbers, row and column number, we need to encode it into a single number (Just to make it simpler for our agent), so we defined two functions to encode and decode our state. These two functions are quite simple, for encoding, the row index (from 0–3) is multiplied by 6 and the column number is added, while for decoding we used modulus operator and division to get the row and column number.

For each coming action, we get our current state, that is our agent’s current row and column position, and based on this position and the selected action we define its reward. For example, if action is 0, meaning Up, and we are already in the first row, the action would cause our agent to leave the boundaries of the field, so we send a penalty of -10 to let the agent know that this is a wrong action in the current state. And if the agent choose action 4, that is Sit, and our current position is a chair, that is we are in a row and a column that corresponds to a ‘+’ character, we will reward it with +10 to let it know that this is the action that we are looking for.

At the end of each step we would return the new environment state, which in our environment is defined as the agent’s new position in the field, resulting reward, a boolean to show if the environment is done or not (We don’t really have a done condition in our example) and some debug info that we currently don’t provide.

Now we are almost done with all the important functions in our environment. The only things that remain are a reset function and a render that is quite simple.

In the reset function we just make sure to reset everything that is needed to go back to the initial state. In render function we print out our visual representation of the field and we also highlight the current position of our agent, this function is used to show the output of our environment, we will see this output later.

Photo by Rock'n Roll Monkey on Unsplash

Implementing an Agent

Now that we are done with our environment, we have to define an agent to actually use this environment. To define a learning agent, we can use multiple ways such as Neural Networks or Q-Tables. In our current example we can use Q-Table as it is a more simpler solution. A Q-Table is basically a matrix of possible states and actions, and as we don’t have too many observation states or actions in our environment we can use this. If our environment had too many states, for example if we had a million possible states, then Neural Networks would have been a better choice as resulting Q-Table would be too big to handle.

Either way, basic ideas are the same whether using Q-Table or Neural Networks. In early stages you choose a random action, send it to the step function, get a reward/penalty, learn each result for each state. If using Q-Table, learning means just putting the result in a corresponding state-action row-column in the matrix, if using Neural Networks, it means training the model with combination of state-action and received reward.

In later episodes, instead of choosing a random action, you would get an action from the filled matrix, if using Q-Table, or from the trained model when using Neural Networks. By using an epsilon variable we encourage the agent to take random actions at first, and by every episode we reduce this variable to also reduce random selection and use learned values. For each step we also call the render function to show an output.

As you can see in the code, we are looping through steps with while not done condition but there was no done condition in our environment, you might ask? Right, our environment does not define any done condition, but there is one more implementation step that I haven’t mentioned yet.

Register your environment

To use your custom environment, you have to register it under OpenAI gym’s environments. In order to register it you should create a file named __init__.py with contents like this:

Here you can see that we defined max_episode_steps, this enforce a maximum of 50 steps per episode and will set the done variable to true automatically after 50 steps. Beside defining this, we also define our environment name by assigning a value to id, here ChairField-v0. Additionally, we point Gym to actual location of our implementation, in this example folder envs.custom.ChairFieldEnv and class name :ChairFieldEnv.

And now let’s run our agent and see the output for its first episode and last.

You can see for the first episode it’s not actually doing anything useful, as the agent has no idea of the environment at this stage, it will just try actions randomly. But now let’s take a look at our agent’s last episode.

From the output it’s clearly visible that our agent now understands that it has to come to the chairs and sit on a chair, as it reached the wanted goal only after 4 steps. And this is just after 48 episodes of training.

Conclusion

OpenAI’s Gym is a powerful tool to do Reinforcement Learning. There is basically no limit in what it can do, it can solve a lot of different problems, from very simple to very complicated ones. Our example here was just for learning purposes but, you should keep in mind in a real scenario when it gets complicated and when you choose to go with Neural Networks, then you need a local machine that is actually a beast or you need money to do it all in a virtual machine and/or you have a lot of time to waste for training. To get an appropriate result for a complicated environment you need to train your agent at least for some hundred thousand episodes, and fun fact at the end, OpenAI Five actually played DOTA-2 for equivalent of 45,000 years to achieve world championship! Over 10 realtime months of training.

While it is not always the fastest or most efficient way to solve problems, it is really fun to work with OpenAI’s Gym and it is fascinating to see a machine teach itself!

Further readings

https://openai.com/blog/first-retro-contest-retrospective/

https://medium.com/@tristansokol/day-11-ac14a299e69d

https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

https://www.novatec-gmbh.de/en/blog/introduction-to-q-learning/

https://www.novatec-gmbh.de/en/blog/deep-q-networks/

https://www.novatec-gmbh.de/en/blog/creating-a-gym-environment/

https://towardsdatascience.com/reinforcement-learning-with-openai-d445c2c687d2

https://www.kaggle.com/charel/learn-by-example-reinforcement-learning-with-gym

References

--

--