M2M Day 340: I’m getting better! Now, I only need 167 years to become the world’s best chess player.

7 min readOct 7, 2017

This post is part of Month to Master, a 12-month accelerated learning project. For October, my goal is to defeat world champion Magnus Carlsen at a game of chess.

Yesterday, I determined that my best chance of defeating Magnus is learning how to numerical compute chessboard evaluations in my head. Today, I will begin to describe how I plan to do this.

How a deep learning-based chess computer works

It’s probably useful to understand how a computer uses deep learning to evaluate chess positions, so let’s start here…

Bitboard representation

The first step is to convert the physical chessboard position into to a numerical representation that can be mathematically manipulated.

In every academic paper I’ve read, the chessboard is converted into its bitboard representation, which is a binary string of size 773. In other words, the chessboard is represented as a string of 773 1’s and 0's.

Why 773?

Well, there are 64 squares on a chessboard. For each square, we want to encode which chess piece, if any, is on that square. There are 6 different chess piece types of 2 different colors. Therefore, we need 64 x 6 x 2 = 768 bits to represent the entire chess board. We then use five more bits to represent the side to move (one bit, representing White or Black) and the castling rights (four bits for White’s kingside, White’s queenside, Black’s kingside, and Black’s queenside).

Thus, we need 773 binary bits to represent a chessboard.

The simple evaluation algorithm

Once the chessboard is converted into its numerical form (which we will denote as the vector x), we want to perform some function on x, called f(x), so that we get a scalar (single number) output y that best approximates the winning chances of white (i.e. the evaluation of the board). Of course, y can be a negative number, signifying that the position is better for black.

The simplest version of this function y = f(x) would be y = wx, where w is a vector of 773 weights that, when multiplied with the bitboard vector, results in a single number (which is effectively some weighted average of each of the bitboard bits).

This may be a bit confusing, so let’s understand what this function means for actual chess pieces on an actual chessboard…

This function is basically saying “If a white queen is on square d4 of a chessboard, then add or subtract some corresponding amount from the total evaluation score. If a black king is on square c2, then add or subtract some other corresponding amount from the total evaluation score” and so on for all permutations of piece types of both colors and squares on the chessboard.

By the way, I haven’t mentioned it yet, so now is a good time to do so: On an actual chessboard, each square can be referred to by a letter from a to h and a number from 1 to 8. This is called algebraic chess notation. It’s not super important to fully understand how this notation is used right now, but this is what I mean by “d4” and “c2” above.

Anyway, this evaluation function, as described above, is clearly too crude to correctly approximate the true evaluation function (The true evaluation function is the just the functional form of the brute force approach. Yesterday, we determined this approach to be impossible for both computers and humans).

So, we need something a little bit more sophisticated.

The deep learning approach

Our function from above mapped the input bits from the bitboard representation directly to a final evaluation, which ends up being too simple.

But, we can create a more sophisticated function by adding a hidden layer in between the input x and the output y. Let’s call this hidden layer h.

So now, we have two functions: One function, f¹(x) = h, that maps the input bitboard representation to an intermediate vector called h (which can theoretically be of any size), and a second function, f²(h) = y, that maps the vector h to the output evaluation y.

If this is not sophisticated enough, we can keep adding more intermediate vectors, h¹, h², h³, etc., and the functions that map from intermediate step to intermediate step, until it is. Each intermediate step is called a layer and, the more layers we have, the deeper our neural network is. This is where the term “deep learning” comes from.

In some of the papers I read, the evaluation functions only had three layers, and in other papers, the evaluation function had nine layers.

Obviously, the more layers, the more computations, but also, the more accurate the evaluation (theoretically).

Number of mathematical operations.

Interestingly, even though these evaluation functions are sophisticated, the underlying mathematical operations are very simple, only requiring addition and multiplication. These are both operations that I can perform in my head, at scale, which I’ll discuss in much greater depth in the next couple of days.

However, even if the operations themselves are simple, I would still need to perform thousands of operations to execute these evaluation functions in my brain, which could still take a lot of time.

Let’s figure just how many operations I would need to perform, and just how long that would take me.

Again, suspend your disbelief about my ability to perform and keep track of all these operations. I’ll explain how I plan to do this soon.

How much computation is required?

Counting the operations

Let’s say that I have an evaluation function that contains only one hidden layer, which has a width of 2048. This means that the function from inputs x to the internal vector h, f¹(x) = h, converts a 773-digit vector into a 2048-digit vector, by multiply x by a matrix of size 773 x 2048. Let’s call this matrix W¹. (By the way, I picked this setup because the chess computer, Deep Pink, uses intermediate layers of size 2048).

To execute f¹ in this way requires 773 x 2048 = 1,583,104 multiplications and (773–1) x 2048 = 1,581,056 additions, totaling to 3,164,160 operations.

Then, f² converts h to the output scalar y, by multiplying h by a 2048-digit vector, called w², which requires 2048 multiplications and 2047 additions, or 4095 total operations.

Thus, this evaluation function would require that I perform 3,168,255 total operations.

Counting the memory capacity required

To perform this mental calculation, not only will I need to execute these operations, but I’ll also need to have enough memory capacity.

In particular, I’ll have needed to pre-memorize all the values of matrix W¹, which is 1,583,104 numbers, and vector w², which is 2048 numbers. I would also need to remember h while computing f¹, so I can use the result to compute f², which requires that I remember another 2048 numbers.

Let’s now convert this memorization effort to specific memory operations.

For the 1,583,104 weights of W¹ and the 2048 weights of w², I would only require a read operation from my memory for each (during game-time computation).

For the 2048 digits of h, I would require both a write operation and a read operation for each.

Thus, I would require 1,587,200 read operations and 2048 write operations, or 1,589,248 in total.

How long would this take?

We now know that I would need to execute 3,168,255 mathematical operations and 1,589,248 memory operations to evaluation a given chess position. How long exactly would this take?

This of course depends on the size of the multiplication and divisions, and the sizes of the numbers being stored in memory. I’ll talk more about this sizing soon, but for now, I’ll just provide my estimates.

I estimate that I could perform one mathematical operation in 3 seconds and one memory operation in 1 second. Thus, I could evaluate one chess position in 11,094,013 seconds or a little over four months.

Clearly, this is a long time (and doesn’t factor in the fact that I can’t, as a human, process continuously for 4 months), but we’re getting closer.

Of course, in a given chess game, I would need to make more than one evaluation. Since I’m a pretty novice player, I estimate that I would need to evaluate ~10–15 options per move.

Since the average chess game is estimated to be 40 moves, this would be about 500 evaluations per game.

Therefore, to play an entire chess game using this method, I would need 2,000 months or 167 years to play the game.

Of course, this is still problematic, but way less problematic than yesterday’s conclusion of one trillion trillion trillion years. In fact, I’m getting closer to an approach that I could actually execute in my lifetime (let’s say I have 75 years left to live).

The next step

I’ve made two assumptions above, one of which is good news for me and one of which is bad news for me.

First, the bad news: It would take me 167 years to play one chess game, assuming that my one-level deep evaluation function was sufficiently good. I suspect that one level isn’t enough, and that at least two or three levels would be needed, if not more.

The good news is that I’m not a computer, which means there is no reason I need to use bitboard representation as my starting point, potentially allowing me to reduce the size of the problem substantially.

Computers like 1’s and 0’s, and don’t care too much about “big numbers” like 773, which is why bitboard representation is optimal for computers.

But, for me, I can deal with any two- or three-digit number just as well as I can deal with a 1 or 0 from a memory perspective and almost as well as I can from a computational perspective.

I think I could squash my chessboard representation to under 40 digits, which would significantly reduce the number of operations necessary (although, may slightly increase the operation time).

In the next few days, I’ll discuss how I plan to reduce the size of the problem and optimize the evaluation function for my human brain, so that I can perform all the necessary evaluations in a reasonable timeframe.

Read the next post. Read the previous post.