My journey of trying Deep Reinforcement Learning to optimize train scheduling:

Mels Hakobyan
Analytics Vidhya
Published in
6 min readJan 9, 2020

First of all this is a competition on AIcrowd called flatland, which is, I think, over by the time of writing this. My goal of making RL algorithm to tackle this problem was not to win the competition, in fact I never even submitted any solutions (well I am still trying to make it work). Rather, I had an idea to make a train scheduling software, using RL, for a long time. So imagine my happiness when I accidentally found this amazing competition. My goal of creating such a software is to show people, that RL is an amazing tool for solving many problems in real world. And also to do a research and try stuff, and be creative, and to do a great amount of problem solving, which I will break down for you in this post. This post is about my journey of problem solving for this specific task (which is not even over yet).

The problem

So the setting of the problem is that there are n amount of trains, on hxw sized map, (the amount of trains and the map size vary) each of which should get to their destinations in the lowest possible time-steps. Taking into account resource limits, and also the fact that one road block can be used only by one train, really makes it a tough problem. And if this is not tough enough, competition creators give you the second round were trains don’t have the same speed, and sometimes suddenly (with some probability) some of the trains may break and get stuck in the middle of the railroad for a while. For more information you can go to the competition link which I provided above.

Reward and Observation

Reward is -1 for every step until done, and also I added -2 for every illegal move (for example, illegal move being turning left when railroad doesn’t give that option). Observation is a bit complicated. First of all every train gets it’s individual observation and the environment can give you three types of observations — TreeObservation, GlobalObservation, LocalObservation. I used TreeObservation and GlobalObservation and I will briefly tell you about both. TreeObservation….oh boy…..it gives you 12 different features (like cell distance to the next branching, or cell distance to destination etc.) after every possible action, and also after every possible action after every possible action, and that’s with the depth of three. GlobalObservation on the other hand gives every agent the complete map of transitions in the form of a hxwx16 matrix and also other features as well, like that agent’s direction and position. And LocalObservation is the same as GlobalObservation but instead of the full map it has limited view for every agent by radius r and the center being that agent.

Quick disclaimer, I won’t explain what is Reinforcement Learning and many other stuff as well.

Many things that I tried

At the very beginning I didn’t even start from reading similar problem solving papers, because I had a much larger problem, variable amount of agents!! My position was to create every algorithm from scratch so I didn’t use RLlib even though it offers great tools both for multi and single agent reinforcement learning settings. One possible way was to make a separate network for every agent, and all is good but there would be no communication between agents, and when they need to come up with a way to share a resource it won’t be possible. Then I thought to use a technique where the Critic is centralized and gets all agents’ observations and give everyone a state-value and Actors adjust their policies in that direction, here all agents have separate networks so the problem of variable agent amount is out the way. Yet there is a problem of decentralized execution where agents have no idea about each other’s decisions during execution. So I was stuck thinking about ways to make a channel of communication between agents, and came up with a great idea, which later I found out is not mine, and some people already wrote an amazing paper about it and named it Multi-agent Bidirectionally-Coordinated Nets ( BiCNet). The idea behind it is the following, every agent in the environment is represented as a bi-directional RNN sequence and passes it’s hidden state to other agents, and as it’s a bi-directional RNN, every agent knows about every other agent through hidden states. And the best part about it, is that you don’t need to design the communication channel, they will learn the best way of communication “language” in the training process.
Now as we have the algorithm architecture it’s time to choose the algorithm itself. In the sea of algorithms like DQN, DDQN, A3C etc. I chose the very amazing PPO from the family of Actor-Critic type algorithms. The loss I calculated for each agent individually and summed it up at the end for training. The mean, instead of the sum, works as well, so I think later I need to work with it a little more, because the sum will get bigger as the number of agents increase, and I don’t want to mess with my network that way. The choice of PPO was based on the fact that it’s from Actor-Critic family of algorithms which work the best, and also it’s a lot more stable than A3C, so here we are. If you would like to learn more about PPO, I would recommend checking out this amazing video.
I set up the training process in a way that complexity increases over time/iterations, so the training would start from 1 agent in a small map, and the map would get bigger and the agent amount increments after several map size increments. In the future I want to try a way of increasing complexity based on how good is the algorithm performing rather than just time.
Now it’s the time to talk about the network architecture. At first I tried to use the TreeObservation and I worked with a different architecture but later I switched to GlobalObservation and I will talk about the network I used there. So the observation has the size of hxwx22xnumber_of_agents, so I used CNN layers on top of RNN layers. As I said the map size varies so I used Spatial Pyramid Pooling to later give it to my RNN component. I did some tricks to give the CNN all agents’ observations as batches and later to extract the real batches of every agent’s observation for RNN. I used Pytorch because it’s awesome - easy to work with, easy to build and test new ideas and also the code looks just amazing. All the code I am talking about could be found in here.

Results

I can’t brag about my results YET!! So the deal is that my code is not very optimized for my computational budget, and I am hitting memory error before getting any valuable results. What I can tell is that while training, there is a very noticeable movement towards learning, as I can see agents, that stop to let other agent pass, and many little details like that. I am far from done for sure, and will spend as much of my free time on this as I can, because I have a lot of ideas on how to make this work.

Ideas for future

So I want to optimize the code first, to not hit memory error at least. Also I have some bugs and would like to get rid of those as well (I want to try to keep training for several iterations on a same map before changing it, but the bug won’t let me). The major change will be to incorporate planning inside my algorithm to decrease the time of training, and also it seems intuitive to have planning in such a project, so that’s a must have in future. Maybe later I can try Hierarchical Reinforcement Learning as I have couple of ideas for it as well.

My journey is not over yet, and I am pretty confident that I can solve this problem using modern RL techniques. And why I am so concentrated on RL is because one algorithm can be applied to many other similar problems, after solving train scheduling next can be planes, trucks, ships, you name it. And also by using learning algorithms we ensure that over time it will get better with little to no changes. Also learning process may give us insights on solving the problem in a way never been thought before.
That’s it for part 1, see you in part 2.

--

--