ChatGPT, Chess and Chinese Spy Balloons (or what is Concept Drift?)

Aaron Margolis
Machine Minds AI
Published in
6 min readFeb 14, 2023

This past weekend, I got a lot of engagement for a tweet analyzing ChatGPT’s ability to play chess. However, I was wrong, a victim of a bias called Concept Drift, the flip side of Confirmation Bias. Also, this week, we learned that Unidentified Flying (or Floating?) Objects have been around a lot longer than we thought, and only now has the US realized. Our failure to notice these balloons is another example of Concept Drift. So how can we avoid it in the future?

For me, it started with a tweet. Specifically, about ChatGPT playing chess.

ChatGPT is not good at chess. In fact, it often plays moves that are not only incorrect but also illegal. Even with the ability to play any move it wants, such as bringing back its captured queen or moving its rook wherever it wants, it still lost to Stockfish, a chess engine. ChatGPT is a Large Language Model- it is good at writing moves that look like chess moves, and it knows which moves are more common than others, but it has no sense of which ones are actually playable. I noticed correctly that many of ChatGPT’s moves were illegal, but for every illegal move I noticed, I missed many more. I guessed 10% of the moves were illegal. Actually, almost half were.

Here’s the game, with ChatGPT’s illegal moves in bold italics:

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O O-O 6. Bxc6 dxc6 7. Re1 … 8. d3 d6 8. Nbd2 Nd7 9. Nc1 Nf6 10. Bg5 h6 11. Bh4 g5 12. Nxg5 hxg5 13. Bxg5 Nxg5 14. d4 exd4 15. f4 Qh4 16. fxg5 Qh1+ 17. Kxh1 Nf3 18. Qxf3 Qxf3 19. gxf3 Nf6 20. gxf6 gxf6 21. Ng3 Kh8 22. Rad1 Rg8 23. Rxd4 Rxd4 25. Re2 Rg1+ 26. Kxg1 Rg2+ 27. Rxg2 Kg7 28. Nh5+ Kh8 29. Nxf6 gxf6 30. b4 a5 31. bxa5 bxa5 32. Kf2 Ke7 33. h4 h5 34. Rg8 Rxg8 35. Ke3 Ke7 36. Kd4 Kxf6 37. Kd3 Kxd5 38. exd5

After Black’s 18th illegal move, a White pawn captures the Black king, ending the game. (Note: Black didn’t do anything on move 8, but I’m not sure if that was a mistake by ChatGPT or the game creator.) So why did I think there were only a handful of illegal moves?

There were two factors at play. The first factor is that I am primed to see moves as legal, since any game I watch or read never contains illegal moves. That caused me to overlook Black’s second obviously illegal move (8…d6), adding a 9th Black pawn to the game, and the third illegal move (13… Nxg6), in which the knight moves diagonally rather than in an L. The second factor is that I came to my estimate of 10% illegal moves during the first part of the game and did not update my estimate after 6 illegal moves in a row (moves 15 to 20). ChatGPT clearly memorized the first few moves, which are an opening named Ruy Lopez after the Spanish priest who popularized it in 1561 in one of the first books about chess. But Black doesn’t castle on the 5th move after moving the pawn on the 4th, because the bishop is in the way. That was the illegal move I did notice.

In my defense, I issued a correction after someone pointed out my error. But, as of this article’s publication, the original, wrong tweet has been viewed 100 times as much as the correction. The way of the web, I suppose.

Now what does this have to do with the Chinese spy balloon? After the Department of Defense noticed the huge balloon floating over US territory, it went back to old radar data and noticed that it missed multiple earlier cases of these balloons, going back years. Like my failure to notice illegal moves, this failure was an example of a common bias called Concept Drift.

Concept Drift (or Data Drift) results from Confirmation Bias: we expect the future to resemble the past. It takes a jolt, like a giant balloon or someone pointing out my error, to go back and see the problem that was under your nose the whole time. Here’s an explanation from an April 2020 paper:

“Governments and companies are generating huge amounts of streaming data and urgently need efficient data analytics and machine learning techniques to support them making predictions and decisions. However, the rapidly changing environment of new products, new markets and new customer behaviors inevitably results in the appearance of concept drift problem. Concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways.”

I was a victim of the second type of drift, identified in the paper: Gradual Drift. In April 2020, data scientists were confronting a more obvious case, the Sudden Drift. Recommendation engines trained before the pandemic were useless as consumer preferences changed so rapidly.

So what does this have to do with Chinese Balloons?

On February 6, the commanding general of Northcom, in charge of protecting North America from foreign threats, admitted that undetected Chinese spy balloons were a “domain awareness gap.” After reanalyzing overlooked data, analysts realized that similar events have been occurring for over 2 years. How could there be a gap as large as multiple buses?

To understand how these balloon went undetected, you have to realize that operating a radar- translating signals into knowledge- is hard.

Above is an example from Introduction to Radar Systems. A radar operator has to be able to identify those squiggles as a Lockheed WV-2 Early Warning radar plane. In particular, a radar operator has to know that those white squiggles mean a friendly US plane, while a different set of squiggles means an unfriendly foreign plane. It requires a lot of training, and split second decisions can mean the difference between getting fighters in the air in time to intercept, or getting found and blown up.

What makes this problem even harder is that there are a lot of squiggles. To assist in quick identification, radars can use “speed gates” to filter out any radar signals going below a certain speed, for instance, 75 knots (86mph ot 140 km/h). Anything going slower than that speed is assuming to be a bird, or a reflection, but definitely not an airplane or something of concern. In other words, the Department of Defense didn’t find these balloons, because they weren’t looking for them.

So what we can we do about Concept Drift? We need to regularly revisit old assumptions and confirm that our models are still working as intended. We don’t need an event as dramatic as a pandemic, or as in-your-face as a balloon in the sky, to reevaluate our work. It is always tempting to declare a model “finished” as soon as it’s put into production, but checking on its performance 3, 6, and 12 months later is equally important. Having false confidence in an outdated model can be worse than having no model at all.

We need to get away from expecting the expected. Here is a classic psychology test. Count the number of basketball passes in the video below.

How many passes did you count? Did you notice the person in the gorilla costume walking through the video? Or did you miss it because you were focused on the counting?

I was just as guilty of missing the gorilla as anyone watching this video. Because I was expecting a chess game, I missed all the illegal moves by ChatGPT. Because Northcom was looking for airplanes, it missed balloons. Real world data is going to have lots of other things happening beyond the columns and rows. Lots of datasets shift dramatically in spring 2020 to the “new normal,” and have slowly shifted back to a “new new normal.” Rather than throw them away, we must incorporate these changes into our model. The world is always growing and evolving. Our models of the world must do the same.

--

--