Day 4 & 5 of the OpenAI. Retro Contest

Getting the baseline Rainbow DQN agent to work & debugging the infamous ImportError: libcuda.so.1: cannot open shared object file

Tristan Sokol
4 min readApr 11, 2018

Fresh from snagging a second place by submitting the JERK baseline implementation in day three, getting the better performing rainbow baseline agent to work seemed like the logical next step in learning, and climbing up the leaderboard.

I thought it would be kind of easy after all my learnings from the jerk agent but being wrong on an expectation of ease seems to be a theme for our team.

From copying the code from GitHub to getting a successful agent I ran into the following snags:

  • I didn’t have tensor flow installed, but that was a quick ‘pip3' away.
  • Then I didn’t have something called anyrl 🤷🏻‍, pip solved that for me too.
  • I copied the file from GitHub instead of cloning so I spent a cycle getting the ‘sonic uitls.ppy’ from the repo as well.

Then the real issues started!

Lots & Lots of Red
Lifesaver!
  • pip3 install opencv-python

And now I finally got to the socket connection error from yesterday! I remembered that I needed to build the docker image and went through my handy dandy local evaluation script:

Then I waited for gigs of data to push… …. .…. …… ……. ………

The Big One

I spent twice as much time trying to solve this as I did with any of the other things. I found out that TensorFlow needs Nvidia’s CUDA drivers to run, which is kind of tricky when your computer does not have a GPU. Apparently a common solution is to just install the drivers anyway but when I did install the drivers from Nvidia’s site, nothing changed. I tried brew cask install cuda. same ole

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

staring me back in the face. I tried restarting, reinstalling, no avail. I found a GitHub issue that suggest seeing if you had the file or not, find / -type f -name libcuda.so.* -exec dirname {} \; 2>/dev/null which was neat, and worked, which just left me more confused. Lots of searching & asking on discord, where other people had the same issue. I was convinced that there was just some kind of location issue where TensorFlow just couldn’t find my drivers. That was not the case.

Turns out that I was just continuously building an image that needed a gpu to run, and I needed the non-gpu version. I tried finding dockerfiles that used cpu TensorFlow, like this one from Datmo, but that never worked. I found out that you can make two different dockerfiles, one you can build locally without GPU support, and one that you can build with GPU support for submission. My real issue is that I still don’t know how Docker works and that caused me to continually try to use an image that needed a gpu without one. Luckily another contestant posted this dockerfile that saved me: pgfarley/rainbow.docker

With my local evaluation working, (I had submitted the gpu version a while ago, no real gains it seemed) I decided I had had enough “fun” for the couple days.

--

--

Tristan Sokol

Software Lead at NorthPoint Development. When I’m not helping automate a real estate company, I’m growing succulents in my back yard. https://tristansokol.com/