Automating Victory: Beating browser games with accessible Python (Part 2)

Jon Gaul
henngeblog
Published in
10 min readMay 12, 2023

This one’s for the nerds

Welcome back to the promised part 2 of my previous post. So why split this up aside from being able to capture a greater share of the attention economy? In short, one article per audience. Part 1 is for the people who want to see a speedrun; Part 2 is for the people who want in-depth details on how it works.

This blog post gets to be a grab bag of the topics that I found compelling along the way. So this one’s for the nerds who saw me mention “hackable code samples” last time and put a reminder in their calendars.

Becoming the MVP MVP

Speaking of, the simplest example I could come up with will track a window as you drag it around your screen. To get started, spin up a new Python environment and install MSS and OpenCV. Next, take a small screenshot of something visually distinct (I used the sidebar from a finder window) and save it as “target.png” in the same directory as the following script, like so:

A finder window containing a folder labeled “venv”, a screenshot of the corner of the window labeled “target.png”, and a python script labeled “demo.py”
import cv2
import numpy as np
from mss import mss
from time import time

target = np.array(cv2.imread("target.png", cv2.IMREAD_UNCHANGED), dtype=np.uint8)
screen_capture = mss()
while True:
start = time()
screenshot = screen_capture.grab(screen_capture.monitors[1])
image = np.array(screenshot, dtype=np.uint8)
result = cv2.matchTemplate(image, target, cv2.TM_CCOEFF_NORMED)
_min_amt, max_amt, _min_loc, max_loc = cv2.minMaxLoc(result)
print(
f"Target at ({max_loc[0]}, {max_loc[1]}),",
f"{100 * max_amt:.1f}% confidence,",
f"{(time() - start):.2f} sec"
)

When you run that script, it should look something like the below gif. It’ll constantly print where it thinks your target image is on the screen, even as you drag it around.

A gif of a terminal window reporting the location of a finder window. It updates once per 1.3 seconds
Trust me, this is a gif and not a still image.

This is so agonizingly slow you might forget how impressive it is that so little code can do something so sophisticated. You might think I’m running this on a potato and not the M1 Mac (with Liquid Retina XDR Display (16-inch (3456 × 2234))) that my company gives its engineers.

But as I brag about hardware I don’t technically own, I bring up a good point. The display has a lot of pixels with three color values each and OpenCV is trying to match all of them. “Find the image” sounds a lot simpler than “match a subset of a 3-D array with millions of entries”. But we can be smarter! How much simpler can we make the task before it fails? Removing color might be a good start, as would dialing down the resolution:

import cv2
import numpy as np
from mss import mss
from time import time

x = 4 # scale factor
target = np.array(cv2.imread("target.png", cv2.IMREAD_GRAYSCALE), dtype=np.uint8)
screen_capture = mss()
while True:
start = time()
screenshot = screen_capture.grab(screen_capture.monitors[1])
image = cv2.cvtColor(
np.array(screenshot, dtype=np.uint8),
cv2.COLOR_BGRA2GRAY
)
result = cv2.matchTemplate(
image[::x, ::x], target[::x, ::x],
cv2.TM_CCOEFF_NORMED
)
_min_amt, max_amt, _min_loc, max_loc = cv2.minMaxLoc(result)
print(
f"Target at ({x * max_loc[0]}, {x * max_loc[1]}),",
f"{100 * max_amt:.1f}% confidence,",
f"{(time() - start):.2f} sec"
)
A gif of a terminal window reporting the location of a finder window. It updates once per 0.07 seconds
UNLIMITED POWER

So it’s not quite as confident and takes a bit more code but it’s 20x faster with just a little cleverness. For a video game, you could further speed up the process by identifying the coordinates of the game window and only taking a screenshot of that region.

Meanwhile, if you’re more interested in how I automatically controlled the mouse, there’s no better overview than Al Sweigart’s incredible book, Automate the Boring Stuff with Python. Specifically, Chapter 20 talks all about the PyAutoGUI library and provides some of the best demo code out there.

It’s kind of mind-blowing how straightforward it is to handle the inputs and outputs to literally anything you do in a computer. But you know what wasn’t straightforward? How I optimized the strategies using image processing tricks.

“Convoluted?” I find it quite simple

One of the sections I trimmed down for the general audience was about how I sped up the simplest strategies. Once again, that strategy is: “if a revealed tile’s number is less than or equal to your level, all of its unrevealed neighbors can be clicked” and its pseudocode is:

for row in grid:
for column in row:
if grid[row, column] ≤ level:
for neighbor of (row, column):
if neighbor is unclicked:
safe_to_click.add(neighbor)

The code above isn’t objectively bad, nobody reading this should get discouraged because their code looks like this. This section is more about how helpful it is to have a broader knowledge of different fields in programming. Because “for each element in a grid, do something based on its neighbors” sounds a lot like numerical convolution.

Convolution shows up in a few places, in image processing it’s used to transform an image on a pixel-by-pixel basis based on that pixel’s neighbors. A moving window of weights, here called a “kernel,” is multiplied at each pixel and summed to produce a new pixel, eventually generating a new image as a result. Enjoy this gif from Wikipedia that explains it better than I can:

An animation of 2d convolution
https://en.wikipedia.org/wiki/Kernel_(image_processing)

It’s a cool process — the operation is fairly simple but it can be used for complex tasks like blurring or sharpening images. It shows up in machine learning contexts as well, and the coolest part is that a lot of brilliant image processing and machine learning nerds have heavily optimized convolution to be quick and efficient. To get that speed for my project, first I get a boolean array of whether the visible numbers are less than or equal to my current level — easy with NumPy.

A section of a game board, squares with numbers less than or equal to 3 have been highlighted.
(Omitting highlighting for distant low numbers)

Next, I use a kernel that adds all neighbor values in that boolean array. Performing that convolution gives an array where every value is equal to the number of adjacent “safe” numbers. This will be over 0 if I can click that square, but it’ll include a bunch of empty squares and revealed monsters. Finally, I use NumPy’s boolean array operators to find the overlap between these safe squares and the unrevealed squares.

A follow-up to the previous image. The same section of board with highlighted squares has been transformed and now all squares adjacent to the highlighted squares are now highlighted. If the adjacent square is unclicked, it is marked with an “x” to denote that it can be clicked safely.

Or in pseudocode:

low_neighbors = known_neighbors <= game_level
kernel = [
[1, 1, 1],
[1, 0, 1],
[1, 1, 1]
]
has_low_neighbor = scipy.signal.convolve2d(low_neighbors, kernel) > 0
safe_to_click = numpy.all(has_low_neighbor, unrevealed_tiles)

Just like that, a fast and simple way to find safe tiles to click that uses no for-loops! And this is the simplest use of convolution in this project. There’s an upgrade to this exact strategy that only requires one additional step. You can take the visible numbers and subtract the values of any known adjacent monsters. That resulting grid more accurately reflects the value of unknown tiles and can be used instead of “known_neighbors” above. The best way to get that grid is, you guessed it, another convolution.

Or if you’re really crazy, you can use convolution to indicate the direction of the relative location of a single unrevealed neighbor. My code for that is right on the line between brilliant and stupid.

A pair of 3d CGI lockets opening in sync. One has the numpy logo and “my beloved”, the other has the scikit-learn logo and “my other beloved”

Convolution is a powerful tool that fits the project perfectly, but I only happen to know about it due to some computer vision background. One of my favorite parts of working in programming is how much it pays off to have a diverse set of interests. If you’re looking to diversify your interests, 3blue1brown has an excellent video on what convolution is and why it’s cool.

What’s next?

There are a couple of lingering bugs. One particularly weird one happens because the script puts in too many inputs too fast, and the game can’t update the visuals in time for the next screenshot. Given that this bug arises from the limit of how quickly the game can be played, there’s not much extra time to shave off beyond luckier starts.

Speaking of luck, the next real achievement would be completing the blind difficulties more consistently. As a reminder, you don’t level up in those modes; you only win once you’ve clicked every tile which doesn’t contain a monster. My struggles with that particular mode were a big part of why I started this project in the first place. The mode is hard and takes a lot of guesswork, even with optimal play, but surely there are some ways to guess better.

A repeating gif of the bot failing quickly failing at the hardest difficulty.
I have lots of footage like this

Machine learning is well-suited for solving problems like this and only has to be better than a random guess to increase my win rate. Here, I could set up a network that takes in the information around a given tile and outputs the estimated value for that tile. Getting training data would be as simple as recording the actual monster positions every time I fail. The biggest problem with this is that even though the actual process is simple, implementing it would take a lot of time and effort for something that may or may not pay off.

An alternative that’s guaranteed to pay off would be to skip the guesswork instead and use statistics. There are multiple arrangements of possible monsters, and the ones that use more common monsters are likelier than others. If we wanted to quantify that likelihood, we could compute all the valid arrangements given the currently-known information, then find the mathematically safest square to click.

What’s great about statistical strategies is that they’re pretty close to existing problems in the field! Partition problems take a collection of possible values (monster levels) and see if they can be combined to make certain other values (the visible numbers). If I could find an existing solver, I could save some serious effort!

A pile of blocks of different sizes. They are stacked into piles which sum to size ten.
https://codegolf.stackexchange.com/questions/48486/the-partition-problem-sorting-stacks-of-boxes

Except that partition problems are NP-hard. There’s no quick way to calculate the values I need perfectly and finding a quick way would make me wildly famous. A little research shows that while finite versions of Minesweeper (and its variants) are NP-complete to solve, infinite versions of it are, in fact Turing complete. This isn’t the only game to be unexpectedly Turing complete, but it does imply that if you had a large enough Minesweeper board, you could encode any arbitrary program into the mine placement such that beating the game executed that program.

An illustration of an AND Gate which uses a minesweeper board.
Yes, that means you could program Minesweeper *in* Minesweeper. Image from Richard Carini, “Circuits, Minesweeper, and NP Completeness”

Despite how hopeless this makes it sound, I’m still tempted to revisit this project with improvements. Things like NP-hardness only indicate the class of problem, not necessarily how hard it is to make progress. In real life, a hacked-together neural network or a few passes with a basic statistical solver might result in tangible improvements to this program. After hours and hours of work getting the script to work, getting it to play and win, and now writing up two blog posts on the topic, somehow, I still find this compelling enough to keep going.

My text editor autocompletes “AI is so hot ” with “right now,” so you know it’s true.

It’s funny how this story started because I was annoyed at losing in a video game. But there’s something unsettling in how simple it was to go from there to a script that interacts with the world in a human-like way. The “look → think → click” loop it runs was kind of cute when I finished the project last year, but it’s a little intimidating after half a year of large language models like ChatGPT redefining what’s possible during that “think” step.

A DALL-E generated image with the prompt, “A computer using robot arms to control its own mouse and keyboard, oil painting style”
It’s the same weird feeling I get when I use revolutionary image-generation technology to make a silly little image for my silly little blog post.

I’d be overjoyed if reading this has inspired someone to automate some annoying task or improve their productivity. (Side note: if you look at this and realize that your entire job can be automated, that’s rad as hell, please reach out so I can congratulate you.) But I’m not blind; I can see the litany of obvious antisocial uses for this.

Take Twitter’s controversial decision to shut down its free API tier earlier this year. This will cripple shadowy botnets and innocent personal projects alike, but they can’t stop users from writing tweets. A script like mine designed to generate and post tweets through the usual UI is almost impossible to stop, even with added captchas and identity verification.

A screenshot from a captcha-solving service which offers its users the ability to solve captchas via an API or work for the company by solving those captchas.
This unnamed captcha-solving service was not hard to find

For the hackers in the audience, there’s never been a better time to test the limits of what you can automate. For the developers trying to stay ahead of the hackers in the audience, you’ve got a tough job ahead of you. As always, the best defense against automated bad actors is to cultivate a culture where bad actors are rare enough to stand out.

Ultimately automation is a tool that humans will use to enhance human ability, and I think that, on the whole, that’s a good thing. It’s why I wrote this post. Hopefully, you’re inspired to add something cool to the world. Anyway, here’s my script beating Mamono Sweeper’s hardest difficulty in 13 seconds.

Thanks for reading.

To look at the script I wrote, check it out on my GitHub. Also, check out and order my kid’s book at afraidalot.com.

--

--

Jon Gaul
henngeblog

Jon's passion for learning has led him to Tokyo, Japan where he works as a software engineer at HENNGE. He can be reached through Afraidalot.com.