Machine Learning is for 10th-Graders

Write some Python that learns to solve a maze… using 10th-grade math.

How can a single algorithm solve a maze, play tic-tac-toe, or teach a robot to walk, without knowing anything about mazes, tic-tac-toe, or walking?

The algorithm is called Q-Learning: a simple, brilliant piece of artificial intelligence that we’re going to build from scratch, using introductory Python and some 10th-grade math.

If I asked you to write three programs —

(1) solve a maze; (2) play tic-tac-toe; (3) make a robot walk

— you might ask yourself—

(1) solve a maze:      how do I avoid getting lost or stuck?
(2) play tic-tac-toe: how do I win, or at least not lose?
(3) make a robot walk: how do I walk without falling?

You might try to write three different programs, each designed to avoid mistakes, like getting stuck, or losing a game, or falling down.

Q-Learning turns that approach inside-out. QL is a type of reinforcement learning that is perfectly happy to learn from mistakes. Once we learn how to learn, we won’t need to know details about mazes, games, or robots.

If you’re in high school and aren’t afraid of making a few mistakes, keep reading. Machine learning is all about programs that learn from making lots mistakes. How hard can it be to make lots of mistakes?

It’s not hard. I do it all the time.

Examples 1,2,3: can you learn from these mistakes?

(1) I flip a coin and you guess heads, but I say, “no, it’s tails.”

(2) I choose a number from 1 to 10 and you guess 6, but I say, “no, it’s not 6.”

(3) I choose a number from 1 to 10 and you guess 6, but I say, “6 is too high.”

What can you learn from those mistakes? Does calling a coin toss give you any information about the next coin toss? (hint: no).

What about guessing my secret number?

https://github.com/jvon-challenges/guessing-games
(2.1) I am thinking of a number              (1 in 10 = 10.0%)
(2.2) guess of 6 is incorrect!
(2.3) remaining values: [1,2,3,4,5,7,8,9,10] (1 in 9 = 11.1%)
(3.1) I am thinking of a number              (1 in 10 = 10.0%)
(3.2) guess of 6 is too high!
(3.3) remaining values: [1,2,3,4,5] (1 in 5 = 20.0%)

Except in trivial cases: a guess that ends in failure will often clarify the path to success. Don’t ever waste a mistake. Each mistake contains an important lesson of what not to do next (except for flipping coins — which is why football games don’t start with a ref saying I’m thinking of a number…).

So you get trapped in a maze by an evil alien…

Here is the maze (the alien will be along shortly):

https://github.com/jvon-challenges/guessing-games
===================================
... ... ... +++
enter-> (1) ... ... +++
... ... ... +++

... +++ ... ...
... +++ ... ...
... +++ ... ...

... ... +++ ...
... ... +++ ...
... ... +++ ...

+++ +++ ... ...
+++ +++ ... ... <-exit
+++ +++ ... ...

===================================

If I asked you to solve the maze by moving North, South, East, or West, you might take the steps E→E→S→E→S→S, resulting in the solution:

https://github.com/jvon-challenges/guessing-games
===================================
... ... ... +++
enter-> (1) (2) (3) +++
... ... ... +++

... +++ ... ...
... +++ (4) (5)
... +++ ... ...

... ... +++ ...
... ... +++ (6)
... ... +++ ...

+++ +++ ... ...
+++ +++ ... (7) <-exit
+++ +++ ... ...

===================================

Easy. That’s because you know a lot about mazes. You know about walls, boundaries, and blocks. You recognize “enter” and “exit”. You know the meaning of moving N, S, E, or W. You have a goal: reach the exit. You have a strategy: avoid dead ends. You are a total maze expert.

Let’s turn that inside-out. Let’s say you’re trapped in the maze by an evil alien. But: the evil alien doesn’t tell you it’s a maze — just that you are trapped.

In fact, the alien just says, “you are in position [0,0].”

Then you wait, and the alien waits.

Finally, you summon your courage and ask the alien, “what can I do?”

The alien says, “you can choose N,S,E, or W”.

Not knowing what else to do, you take a deep breath and say, “I choose N.

The alien says, “that went badly.”

And then you wait, and the alien waits, but nothing happens.

Finally, you get bored and ask, “can I try again?”

The alien responds, “you are in position [0,0].”

So what do you choose, this time? Probably not N… N went badly.

So you say, “I choose E”.

And the alien says, “you are in position [0,1].”

Hey! We’re getting somewhere.

Write a program to defeat the alien (and its annoying friends)

Before long, you tire of talking to the alien. You write some Python code to deal with not only this alien, but any future aliens who exhibit similar annoying tendencies.

Turns out these aliens have a few things in common:

  • aliens tell you what state you are in (e.g., position [0,0])
  • aliens allow you to choose among actions (e.g., N,S,W,E)
  • aliens reveal the dimensions of all possible states (e.g., 4x4)

Here is the first annoying alien, in Python. Note that whenever you choose an action, the alien responds with three things: (1) your new state, (2) a reward or penalty, and (3) true/false indicating whether your attempt has ended:

press the run button (top, center) — or open in repl.it (top, right)

You need your own Python code, to defeat the alien. After thinking it through, you realize that your code needs to remember what state you are in, and whether taking an action from that state goes well or badly.

So from your first encounter:

  • state[0,0] action N→ goes badly, game over!
  • state[0,0]→ action E→ neither good nor bad.

(you may guess that “0,0” means row 0, column 0, and “0,1” means row 0, column 1, N is North, going north from (0,0) is out of bounds and therefore ‘bad’, E is East… but don’t bother with details! — that will just make it harder to defeat the annoying tic-tac-toe playing alien that lies in your not-too-distant future)

For starters, you will need some code that can take note of things at every possible state, all 4x4=16 of them. Here is some Python that makes room to store at least one thing about every state in a 4x4 universe:

press the run button (top, center) — or open in repl.it (top, right)

Let’s see… you need to note your first mistake:

  • state[0,0] action N→ goes badly, game over!

You could put something in position (0,0) of our array, but then you might want to make a note of your next attempt:

  • state[0,0]→ action E→ neither good nor bad.

Which would also go in position (0,0) of your array, wiping out your first entry (and forgetting your mistake, dooming you possibly to repeat it).

Turns out your array needs a space to make note of the result of every action taken from every state. That’s called a state transition, because it represents the change in state caused by an action, and the resulting array is a q-table, because it tracks the quality of each transition. To track the effect of state transitions, every state needs four entries, one per available action (N,S,E,W):

press the run button (top, center) — or open in repl.it (top, right)

Now that you have a place to keep track of things, just go ahead and make a bunch of mistakes, by moving at random. If you make a thousand random moves, what would your q-table look like?

press the run button (top, center) — or open in repl.it (top, right)
Here is sample output... in my session, I found the exit... once!
  North  South  East  West
[[[-273.    0.    0. -269.]     row = 0, col = 0
[ -95. -80. 0. 0.] row = 0, col = 1
[ -20. 0. -16. 0.] row = 0, col = 2
[ 0. 0. 0. 0.]] ...

[[ 0. 0. -74. -88.] row = 1, col = 0
[ 0. 0. 0. 0.] ...
[ 0. -13. 0. -5.]
[ -1. 0. -2. 0.]]

[[ 0. -15. 0. -26.] row = 2, col = 0
[ -5. -11. -5. 0.] ...
[ 0. 0. 0. 0.]
[ 0. 1. -1. 0.]] <- see the +1? you went S from (2,3)
and found the exit.
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]] row = 3, col = 3 (aka: the exit)

What does the sample output, above, say?

The first row represents actions from state (0,0); it says: [-273, 0, 0, -269]. That means the cumulative reward for taking actions from (0,0) were:

  • state[0,0]→action N -273 (don’t got north! big penalty)
  • state[0,0]→action S 0 (noting happens when you go south)
  • state[0,0]→action E 0 (noting happens when you go east)
  • state[0,0]→action W -269 (don’t got west! big penalty)

Good news: your q-table clearly says “don’t start by going north or west,” which is good advice. Bad news: you wasted 273+269 = 542 out of 1,000 attempts making those two mistakes.

Humph.

Maybe exploring the problem isn’t entirely a random exercise? Maybe sometimes you should explore the problem, and sometimes you should exploit what you’ve learned. Take a look at the source code, to see how to split your actions 50/50 between exploring and exploiting:

press the run button (top, center) — or open in repl.it (top, right)
(my session produced this output... what about yours?)

the annoying alien says that your starting state is = [0, 0]
[[[-237.    0.    0. -229.]  <- I failed 237+229 = 466 times
[ -29. -30. 0. 0.]
[ -41. 0. -26. 0.]
[ 0. 0. 0. 0.]]

[[ 0. 0. -157. -136.]
[ 0. 0. 0. 0.]
[ 0. -29. 0. -12.]
[ -3. 0. -6. 0.]]

[[ 0. -24. 0. -18.]
[ -4. -4. -2. 0.]
[ 0. 0. 0. 0.]
[ 0. 10. -1. -2.]] <- hey! I found the exit 10 times

[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]]

Almost done. One more thing.

Would it help if you could find the exit, starting from the entrance?

Well, you can — else I wouldn’t have asked — but how?

Or: what is special about state[2,3]→action S? In my session, that transition had a q-value of +10 (lots of rewards!). So here’s the real question: if you were adjacent to state[2,3], and you knew from prior mistakes that transitioning to state[2,3] offered a reward, would you try to take advantage of that? If you were right above state[2,3], in state[1,3], and trying to exploit prior mistakes… what would help you to choose action=S, and transition to state[2,3]? How can you move toward a positive reward?

Or: what if the positive reward had a way to move backwards, toward you?

(I warned you that Q-learning does things inside-out, remember?)

Take a really good look at this code, and run it over and over:

press the run button (top, center) — or open in repl.it (top, right)
Here are the q-values upon finding the exit for the first time:
[[[-1.  0.  0. -1.]
[-1. -1. 0. 0.]
[-1. 0. -1. 0.]
[ 0. 0. 0. 0.]]

[[ 0. 0. -1. -1.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. -1.]
[-1. 0. -1. 0.]]

[[ 0. -1. 0. -1.]
[-1. -1. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 1. 0. 0.]] <- see the +1? Found the exit!

[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]]

Here are the q-values upon noticing a future reward (from row=1 to row=2):
[[[-1.   0.   0.  -1. ]
[-1. -1. 0. 0. ]
[-1. 0. -1. 0. ]
[ 0. 0. 0. 0. ]]

[[ 0. 0. -1. -1. ]
[ 0. 0. 0. 0. ]
[ 0. -1. 0. -1. ]
[-1. 0.9 -1. 0. ]] <- see the +0.9? Looked ahead, noticed
that a reward had been found previously
[[ 0. -1. 0. -1. ]
[-1. -1. -1. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 1. 0. 0. ]]

[[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]]

Here are the q-values when the reward works its way backward, to the start:
[[[-1.    0.    0.59 -1.  ]   <- the reward made its way to (0,0)E!
[-1. -1. 0.66 0. ]
[-1. 0.73 -1. 0. ]
[ 0. 0. 0. 0. ]]

[[ 0. 0. -1. -1. ]
[ 0. 0. 0. 0. ]
[ 0. -1. 0.81 -1. ]
[-1. 0.9 -1. 0. ]]

[[ 0. -1. 0. -1. ]
[-1. -1. -1. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 1. 0. -1. ]]

[[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]]

The reward eventually works its way backward, one step at a time, from the end of the problem all the way to the beginning. Once you know the direction of a positive reward from the very first step, it isn’t hard to defeat the annoying alien:

press the run button (top, center) — or open in repl.it (top, right)

That’s it: a complete solution to a problem in the field of reinforcement learning, using Q-learning. It’s not quite as complete as some implementations, and the code is not too compact, but you did that on purpose, so that other readers can follow the meaning without getting lost in the math.

Next time: tic-tac-toe.