How I trained StarCraft 2 AIs using Google’s free GPUs

Franklin He
5 min readApr 17, 2018

--

TL;DR: If you would like to get started with a FREE StarCraft II Machine Learning environment, complete with GPU hardware, you can check out my starter Google Colab notebook here: https://colab.research.google.com/drive/1AzCKV98UaQQz2aJIeGWlExcxBrpgKsIV

Recently, I’ve started a machine learning project for StarCraft II with a few friends. I believe that the ability to train neural networks fast is crucial to a researcher’s success. To empower StarCraft AI researchers worldwide with a reproducible, efficient environment, that’s also easy to share code. I wanted to see if I could get StarCraft II running on Google Colab — Google’s free machine learning environment providing free GPU time.

However, after downloading StarCraft II and installing the necessary libraries,
You will be greeted by:

I0331 08:30:17.832181 139972195997568 sc_process.py:148] Connection attempt 0 (running: None)****** omitted reconnection attempts ******I0331 08:32:17.048350 139972195997568 sc_process.py:148] Connection attempt 119 (running: -11)
I0331 08:32:18.050124 139972195997568 sc_process.py:180] Shutdown gracefully.
I0331 08:32:18.050344 139972195997568 sc_process.py:166] Shutdown with return code: -11
Failed to create the socket.
I0331 08:32:18.056085 139972195997568 sc2_env.py:327] Environment Close

Constructing additional Pylons does not fix this problem.

Figuring out the return code

The first thing I did was to figure out what’s behind the return code.

Reading the PySC2 source, I found the segment that sets the return code.

The poll() is from Python’s subprocess module, which, after more digging, shows that 11 was the signal that caused StarCraft to terminate.

Signal 11 is the infamous SIGSEGV, the signal that gives C programmers endless nightmares.

To confirm this, I’ve found the SC2 executable, and executed it by itself.

> !~/StarCraftII/Versions/Base59877/SC2_x64
Segmentation Fault (Core Dumped)

Welp.

Debugging, in HARD MODE

Normally, I’d just fire up my favourite debugging tools, and the article would become no more than a walkthrough on how to use GDB.

However, we are dealing with Google Colab, where all we get is a Jupyter Notebook web page. This means:

  1. No Debuggers
  2. No root privileges
  3. Limited tools e.g. no strace
RIP Debugging

When all you have is a web page…

The first step was to try different versions of StarCraft II on the server, as Blizzard provides StarCraft II for version 4.0.2, 3.17 and 3.16.1. Unfortunately, none of those worked.

I then decided to run StarCraft II on my local Linux machine, this is an environment I can reliably control and debug on. It also allows me to test my hypothesis

First Guess: Required library not found

My initial suspicions was that StarCraft II, being a game, may have required certain OpenGL functions and libraries that was not present on the Google Colab environment.

To test this, I ran StarCraft II on my local machine, this time under strace, this allows me to track what system calls StarCraft II does, and since all libraries are loaded via the operating system, this lets me track down any missing dependencies or see if anything weird happens.

The full strace log can be seen here but here’s a small snippet:

execve("./SC2_x64", ["./SC2_x64"], 0x7fffc19e08b0 /* 49 vars */) = 0
brk(NULL) = 0x95bd000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
......
openat(AT_FDCWD, "/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
......
openat(AT_FDCWD, "/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
......
openat(AT_FDCWD, "/lib64/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 3
......
openat(AT_FDCWD, "/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
...
openat(AT_FDCWD, "/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
......
openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
.......

Looking at the results, StarCraft II doesn’t do anything extra other than dynamically linking against the standard C/C++ libraries, which rules out that theory.

Why else would you segfault?

Since the same program being run on my local machine does not crash, that rules out any coding problems from Blizzard.

A quick Google search on how to debug segfaults reminded me of Valgrind, which, to my surprise, worked on Google Colab.

A snippet of Valgrind’s output can be seen here

==354== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==354== Access not within mapped region at address 0x8
==354== at 0x6B3DF0: ??? (in /content/StarCraftII/Versions/Base56787/SC2_x64)
==354== by 0x65FF97: ??? (in /content/StarCraftII/Versions/Base56787/SC2_x64)
==354== by 0x89CD5C6: MallocExtension::Initialize() (in /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.3.0)
==354== by 0x89B7D29: ??? (in /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.3.0)
==354== by 0x7B79AD9: call_init.part.0 (dl-init.c:72)
==354== by 0x7B79BEA: call_init (dl-init.c:30)
==354== by 0x7B79BEA: _dl_init (dl-init.c:120)
==354== by 0x7B69ED9: ??? (in /lib/x86_64-linux-gnu/ld-2.26.so)
==354== If you believe this happened as a result of a stack
==354== overflow in your program's main thread (unlikely but
==354== possible), you can try to increase the size of the
==354== main thread stack using the --main-stacksize= flag.
==354== The main thread stack size used in this run was 8388608.
==354==
==354== HEAP SUMMARY:
==354== in use at exit: 0 bytes in 0 blocks
==354== total heap usage: 4 allocs, 4 frees, 72,710 bytes allocated
==354==
==354== All heap blocks were freed -- no leaks are possible
==354==
==354== For counts of detected and suppressed errors, rerun with: -v
==354== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

Well, the only identifiable function is MallocExtension::Initialize() in libtcmalloc.so.4.3.0.

For those of you who don’t know TCMalloc, it is Google’s custom memory allocator, which is used by products such as Google Chrome.

Wait……

Back when I was stracing StarCraft II, I remember seeing only the C and C++ standard libraries being loaded. This doesn’t seem right, where did TCMalloc come from?

Turns out, there is a way to force TCMalloc onto programs that are not compiled with it. By setting the LD_PRELOAD environment variable on Linux, you can load the TCMalloc shared library into the program, forcing it to use TCMalloc.

What does this look like on Google Colab I wonder…

AHA!

The Solution

Unfortunately, setting the LD_PRELOAD environment variable does not seem to propagate to the rest of the environment.

By executing the following

!apt-get uninstall libtcmalloc*

I’ve managed to uninstall TCMalloc, and, despite the error messages, StarCraft II now runs, and now the door to machine learning with StarCraft II starts.

I’ve filed a bug against Google Colab so that we do not need this hackery in the future, but in the meantime, StarCraft II awaits

FOR STARCRARFT II MACHINE LEARNING

A shoutout to Paul, William and StarAI for telling me about PySC2 and bringing me onboard for some machine learning.

If you need someone who can debug hard problems and get things done: I am currently looking for opportunities

3929

--

--