How to run AI2-THOR simulation fast with Google Cloud Platform(GCP)

Yusun Liu
Apr 24 · 6 min read
an AI2Thor scene screenshot

AI2-THOR is a 3D-Simulation tool. It provides environments that look similar to the real world scenes and is customizable. That makes a good candidate for testing RL algorithms. However, it bears a computation bottleneck due to its’s 3D rendering Engine “Unity 3D game engine”. In this blog, I want to share some experience on how to use Cloud Platform(I used Google Cloud Platform, GCP) to run AI2-THOR as RL training fast and headless.

At first glance, it seems obvious that what you can run on a local machine can be moved to cloud computing to scale up or at least “silence” your job. But there are some caveats to conquer when things involve 3D rendering, OpenGL and Headless. In followings, I describe a recipe of HOWTOs in case TL;DR.

I will write another post to talk about how to use this setting to run RL distributed training on multiple (2x) GPUs with multiple AI2-THOR (6x). The DeepLearning package is Pytorch.


Choose cloud computer type

AI2-THOR requires OpenGL and since I will use GPU to train my neural network, I will select Compute Engine with GPU Instances. In my test, it turns out AI2-THOR or rather Unity 3D game engine is rather CPU hungry so here is my configuration:

  • 2 Tesla K80 GPU with a 12GB graphics memory
  • 16 vCores CPU
  • 40–50 GB storage
  • Image of Intel® optimized Deep Learning Image: Base m24 (with Intel® MKL and CUDA 10.0)
Computer engine configuration

You can also set the “Preemptibility” option “on” in case you don’t run continuously for over 24 hours. This option massively reduces the hourly price.


Rules of thumb:

A Unity 3D process/thread requires 2–3 CPU cores. Multiple Unity processes (>3) can run on a single K80 GPU.


Configure headless server

The server image of the chosen cloud machine does not come with Xserver and is not attached to a physical display. So I have to install the missing software package and configure a “None” display. More subtle the server does not have a sound card which is not required but this will cause unpleasant debug/rubbish logs on the terminal. There is a tutorial from ml-agent which is for AWS platform, but it applies for GCP. I adjust some steps for multi-GPUs cases.

```console
# Install Nvidia driver if system does not prompt you to do so, normally it does when you first start your cloud computer and reboot afterwards# Install Xorg
$ sudo apt-get update
$ sudo apt-get install -y xserver-xorg mesa-utils
$ sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024# Get the BusID information,write down the bus IDs
$ nvidia-xconfig --query-gpu-info
```

You get the GPU Bus-ID, like. “PCI:0:4:0”, “PCI:0:5:0” for each GPU. Now you create a file “xorg.conf” with the following content and save any place, e.g. in your user home directory.

Section “ServerLayout”
 Identifier “Layout0”
 Screen 0 “Screen0”
EndSectionSection “Monitor”
 Identifier “Monitor0”
 VendorName “Unknown”
 ModelName “Unknown”
 HorizSync 28.0–33.0
 VertRefresh 43.0–72.0
 Option “DPMS”
EndSectionSection “Device”
 Identifier “Device0”
 Driver “nvidia”
 VendorName “NVIDIA Corporation”
 BoardName “Tesla K80”
EndSectionSection “Screen”
 Identifier “Screen0”
 Device “Device0”
 Monitor “Monitor0”
 DefaultDepth 24
 Option “UseDisplayDevice” “None”
 SubSection “Display”
 Virtual 1280 1024
 Depth 24
 EndSubSection
EndSectionSection “ServerFlags”
 Option “AllowMouseOpenFail” “true”
 Option “AutoAddGPU” “false”
 Option “ProbeAllGpus” “false”
EndSection

After that you can start your X server with the following commands:

sudo Xorg -noreset -sharevts -novtswitch -isolateDevice "PCI:0:4:0" -config xorg.conf :0 vt1 &
sleep 1
sudo Xorg -noreset -sharevts -novtswitch -isolateDevice "PCI:0:5:0" -config xorg.conf :1 vt1 &

Here I start two X servers with DISPLAY identifier :0 and: 1. By doing so I can run AI2-THOR separately on each GPU, default it uses only first GPU(actually complicated in this case). The parameters to run Xorg is not arbitrary but can be adjusted. You can test OpenGL app glxgears on 2 terminals, both shall give similar results:

# console 1
glxgears -display :0
# console 2
glxgears -display :1

Fix the sound card issue.

As mentioned above, AI2-THOR or Unity 3D probably loads ALSA module, since a sound card is not available when loading the AI2-THOR engine, the terminal is flushed with the following messages:

ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5007:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5007:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM default

The workaround to prohibit the above message is to create a file /etc/asound.conf with the following content. A detailed explanation may be found in this link.

pcm.!default {
  type plug
  slave.pcm "null"
}

Install AI2-THOR and run some tests

Installation of AI2-THOR is straightforward by following readme of the project site. Let’s run some tests with test script and see if configured cloud compute instance can boost the performance of AI2-THOR.

https://gist.github.com/etendue/20c66b694b35568532651f5d7c00252c

This script provides the option to run AI2-THOR environments in multi processes or single thread, as well as to distribute environments to multiple GPUs equally. Here is the bench result. The number in the table indicates the FPS of AI2-THOR render speed. The bigger FPS the better!

The result leads to the following observation:

  • run multiple instances of AI2-THOR in the single process does not increase FPS very much. It decreases quickly when too many instances are launched
  • run multiple instances in separate processes do boost FPS by doubling. It decreases also when too many instances are launched.
  • 2nd GPU also boosts FPS performance.

The decrease is caused likely that CPU usage is saturated with too many instances. As mentioned, one instance of AI2-THOR requires 2~3 vCPU cores. We can postulate with more CPUs cores, e.g. 32, we can probably double FPS.

Conclusion

In overall, use a proper setting of above GCP compute, we can achieve a 2~3x speed up of AI2-THOR performance in comparison a standard laptop with 8 vCPU cores and 1 GPU card. Besides this comparison considers only the FPS of AI2-THOR rendering. In the next blog, I am going to share some experience of training RL algorithm with AI2-THOR. The used framework is Pytorch and involved topics contain torch.distributed and multi-processing packages

12