Using Mask R-CNN in the streets of Buenos Aires

4 min readMar 27, 2018

Google has recently made public a research tool where you can develop your own deep learning applications with a free NVIDIA Tesla K80 GPU for up to 12 hours a day.

There are great posts here and here where you can learn more on how to setup your Google Colab environment and what are the features included. On our side, we will start by running:

The notebook will ask you to click on a link that will open a new tab where you can sign in with your Google account and accept using Colab. Copy and paste back the generated code on the notebook and continue.

Facebook Artificial Intelligence Researchers (FAIR) has been doing some amazing work in Computer Vision and one of the latest things that they have open-sourced is their work in Mask R-CNN. These networks are great for object detection and image segmentation, although is not the only one out there. YOLO or “You Only Look Once” works by looking at the whole image at test time so its predictions are informed by the global context in the image. They run much faster than R-CNN, around 10x, but lack image segmentation. Their Tiny-YOLO version is great for mobile applications.

We will be following this implementation on Python 3, Keras, and Tensorflow. The repo is really well documented and has plenty of example notebooks to test the model on different scenarios.

Source: https://github.com/matterport/Mask_RCNN

After setting up the notebook, we need to import MS COCO, a large-scale object detection, segmentation, and captioning dataset. I had some issues at first, but eventually this code worked out perfectly.

!pip install -U scikit-image
!pip install -U cython
!pip install git+https://github.com/waleedka/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI
!git clone https://github.com/pdollar/coco.git
!cd coco/PythonAPI && make
!cd coco/PythonAPI && make install
!cd coco/PythonAPI && python3 setup.py install

Now we need to set up our visualization functions. This implementation uses matplotlib to draw shapes and labels but in order to run the model over a video stream (camera or external source) we will use OpenCV. The following video by Mark is a great start and he goes through every step in great detail.

Step by step tutorial on running a Mask R-CNN with OpenCV

There’s a slight chance that you find yourself faced with this problem

AttributeError: module ‘PIL.Image’ has no attribute ‘register_extensions

So here’s the solution that worked for me

# !pip install — no-cache-dir -I pillowdef register_extension(id, extension):
 PIL.Image.EXTENSION[extension.lower()] = id.upper()
PIL.Image.register_extension = register_extension
def register_extensions(id, extensions):
 for extension in extensions:
 register_extension(id, extension)
PIL.Image.register_extensions = register_extensions

In this tutorial, I’m going to focus on video streams found in YouTube to show how this network might work on real footage. During my tests, and depending on the video resolution, I got an average of 1 fps, far from the 5 fps reported in the original paper and even further away from the 35 fps reported for YOLOv3.

This is the function I’ll be using to download videos from YouTube to our notebook folder. Source here and here.

from __future__ import unicode_literals
!pip install --upgrade youtube-dl # install if you don't have it
import youtube_dldef YouTube_download(url):
  ydl_opts = {
      'outtmpl': 'yt-video.%(ext)s'
  }
  with youtube_dl.YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])

Once we have our MP4 video, we will run a function that goes through each frame and runs our detection model. This will include the detection of the bounding boxes with their corresponding x, y position for each instance detected (car, person, etc.) and apply a mask layer on top of the detected object to match its shape. I also added a bit of code to choose whether you want all instances in the same class to have the same color or each new instance, regardless of class, to be different. I found the first to be much more simple to follow. Here’s a great example.

Source: Karol Majek @ https://www.youtube.com/watch?v=OOT3UIXZztE

This code is not optimized for speed so be prepared to wait around 30 seconds per second of video for processing.

IMAGE_DIR = “output-dir” # dir to save images# Download YT video
YouTube_download(“https://www.youtube.com/watch?v=bAhprdemJKE")# Run detection and output frames
video_to_frames(input_vid = “yt-video.mp4”, output_loc = IMAGE_DIR, max_fps=30*60)

Once all the resulting frames are processed, you can join them again into a single MP4 file with OpenCV, through a slightly modified version of this code.

Final result of a Mask R-CNN running on Buenos Aires streets

Source: https://www.youtube.com/watch?v=xTByN4uVoCo

Finally, you are also able to train the model on your own dataset if you refer to the original repo but be mindful that the original paper mentioned really long training times.

You can always refer to my own notebook where you can go step by step on every function mentioned on this post. Hope this is useful and looking forward to your comments!

Using Mask R-CNN in the streets of Buenos Aires

Written by Nicolás Metallo