The Maze Runner — AI EDITION
Have you ever watched the movie blockbuster that took the world by storm on 2014, The Maze Runner? Well, that movie certainly didn’t disappoint as it earned over $348 million worldwide at the box office, against its budget of $34 million. The plot of the movie is quite simple, where a group of people are trapped in a giant maze that seems impossible to escape from, and how the so-called ‘runners’ have to navigate through the maze, run from monsters, then to finally find a way out. It was a blast.
Now, i have this question for you — bare with me, the question is this, have you ever wondered… what if, instead of Newt, Thomas, and Minho as the ‘runners’ that navigates through the maze, it was AI in their place? Yup, it totally would makes a whole lot different plot and some extreme imagination to pull it off, and we’re not here to talk about a movie which the plot is an AI that is trying to solve a maze.
But what if i tell you this… AI could indeed navigate through a maze. Well okay, not in the physical world (for now at least). but yes i really mean it when i wrote this, letter by letter, word by word. It’s all could be possible thanks to Machine Learning, specifically Reinforcement Learning. What it is? and how does it makes an unconscious machine to be able to navigate through a maze? Without further ado, let’s get into it!
Part 1 — Reinforcement Learning
Let’s consider this project as a fun & educational field trip. Before we get going, we must stock some ‘supplies of knowledge’ for this ‘trip’. First of all, what is reinforcement learning? and what are the difference to supervised & unsupervised learning?
Supervised learning is all about ‘labels’, where the model will learn and train from the dataset that already contains label. 1 or 0, churn or not churn, spam or not spam, house value regression, etc.
Analogy: students that have a teacher who will give both the questions and the answers to learn for an exam.
Unsupervised learning is all about finding patterns, without any given ‘labels’/guidance. Eventually can be implemented for clustering of different groups, such as for customer segmentation.
Analogy: you just moved to a new city, and you decided to explore the city without a map, or asking anyone, then you try to figuring out the city landmarks and the neighborhood by yourself.
Reinforced learning is all about interaction of trial and errors. The model, which we usually call as ‘agents’ learns by interacting to the ‘environment’. It takes action and we will give feedback by either rewards, if it achieved the goal that we determined, or penalties if it did the opposite. So the agents will go through a lot trial and errors to maximize cumulative rewards, makes the best decisions in different situations, which makes them ‘trained’ to accomplish our goals.
Analogy: you adopt a dog as a pet and you want to train your dog, so when it does something good or adorable you give a treat, but if it does something bad that you don’t want, you might scold it. Over time the dog will know to do more of the good things so it can have reward and less of a bad thing to avoid scolding.
After knowing all the differences and the explanation of each machine learning, our focus is solely on the reinforcement learning. But still, this same question arise: How in the world, a computer could navigate through a maze? even after all those nitty-gritty explanation, it still seems unlogical. This take us to the next destination of our trip.
Part 2 — Environment Introduction
As i was saying, of course the maze and the computer were meant to be implemented on the virtual world. But before we continue, from now on, we will refer the computer/machine as the agent. So, based on the explanation on the previous section we know that the agent will interacts with its environments and will be given rewards/punishments depending on what decision that it made. In our case the environment is the maze, and we will give rewards if the agent is heading towards the end goal of the maze, but we will give penalty if it hit the wall, takes the wrong turn/heading towards the wrong decision.
In the reinforcement training, there are ‘exploration’ and ‘exploitation’ terms. In which the agent will ‘explore’ when it is trying new things, in this case direction/path to see if they can have better rewards. While exploitation means the agent could also exploiting the path that it already knows are good. The key is to make a balanced value to explore and find better paths while also sticking to the paths that have worked well in the past.
The agent then will learns through trial and errors (learning from experiences). It tries different directions, observes the rewards, and learns which directions are better in different places of the maze, so as time goes by, the agent will know the best path and the lowest steps needed to reach the finish line which will give the reward.
Below is the initial maze that will be used.
Now after we ‘introduce’ the environment to the agent, we will firstly define the reward & punishment system which is described below.
finish_reward = 100
wall_penalty = -10
step_penalty = -1
Without further ado, let’s ‘deploy’ our model and let it explore the maze by itself and see how it does
The untrained agent takes a lot of step, because the agent keeps backtracking and hitting walls. Before it has learned anything, the agent blindly going through the maze, choosing the path randomly.
Part 3 — Agent Training
We will use Q-Learning, which is a basic reinforcement learning algorithm. That works by updating the Q-values(quality value) based on the rewards it receives during exploration. he agent eventually learns the best actions to take in different situations to maximize its total reward.
Below is the update formula:
Q(s,a)←Q(s,a)+α(r+γa′maxQ(s′,a′)−Q(s,a))
- Q(s,a): Current Q-value for state 𝑠s and action 𝑎a.
- 𝛼α: Learning rate (how much new information overrides the old).
- 𝑟r: Reward received after taking action 𝑎a.
- 𝛾γ: Discount factor (how much future rewards are considered).
- 𝑠′s′: New state after taking action 𝑎a.
- max𝑎′𝑄(𝑠′,𝑎′)maxa′Q(s′,a′): Maximum Q-value for the new state 𝑠′s′ across all possible actions 𝑎′a′.
Part 4 — Final Result
Now, the agent only takes 11 steps to reach the finish line, indicating that the agent, an unconscious computer could indeed solve a maze.
Conclusions
At the end of our ‘field trips’ we have come a very long way, we now know that even an unconscious computer called as the agent could even solve a maze, with just enough training and patience. So, don’t we all need to ‘train’ and be patience to navigates through the maze of life? Anyway, even though this is the end of our journey. this is not a ‘goodbye’, but simply a ‘see you’.