Introduction to Deep Q-learning with SynapticJS & ConvNetJS

An application to Connect 4 game

Alexandre Sapet

Published in

Sicara's blog

3 min readApr 30, 2018

fig. 1: Screenshot of my React app using the neural networks computed here.

Read the full article on Sicara’s blog here.

Easily get into Reinforcement Learning with JavaScript applying deep Q-learning to a simple game: connect4. I built my first AI thanks to this!

Why connect 4?

When I decided to learn about Reinforcement Learning, I thought that I could begin with Generals.io. After all, it seemed quite easy to build unbeatable AI in Chess, Go and DOTA II. How could I fail to do the same?

Being a noob at both Neural Networks and Reinforcement Learning, I finally decided to start small: connect 4. First thing I was told?

Why connect 4? You can easily brute force it!

True. You can find a complete guide here. But that’s not the point. I wanted to try, fail and eventually succeed in building a learning AI.

What is the famous Q-learning?

Q-learning is the first algorithm given back by Google when looking for reinforcement learning. I won’t explain the whole concept, you can find a great introduction here. I’ll just give a quick summary here.

Q-learning is a solution to a MDP, aka Markov Decision Process. A MDP is defined by an agent, performing in an environment by means of actions. The environment can be in different states.

Given those 3, the idea of Q-learning is to reward the agent if the action it took led him to a better state. Or punish it if not. Those rewards will be used to build the function Q which is an estimate of the final gain, given the state s and the action a.

To ensure that a play is considering the potential future outcome of a game, the factor gamma is used to leverage the maximum Q-value of the next state s+1 among the possible actions.

Temporal Difference Learning

In connect 4, it is not possible to know if an action was a good move or a bad one. You have to wait until the end of the game to know if the succession of actions you took led you to the victory. It is the idea behind Temporal Difference Learning (TD-learning): the agent gets a reward when it wins the game, this reward is then distributed to the previous actions, with a discount.

With Q-learning and TD-Learning combined, each time the agent is rewarded — or punished — the function Q is updated with the following algorithm:

eq. 1: Update of Q-value for a play that happened n plays before the end of the game.

After several iterations, you learn the function Q (got it?).

We need to go deeper!

For simple games (a few actions, a few more states), you can evaluate the exact function Q as a 2D array. For more complex games, you have to approach the function Q. This is where neural networks come into play. They can be flexible enough for this task.