How to embed Detectron2 in your computer vision project

Use the power of the Detectron2 model zoo.

Jarosław Gilewski
Dec 1, 2019 · 9 min read

This article will show you how to efficiently use Detectron2 pre-trained models for inferences using modular computer vision pipeline for video and image processing.

Table of contents:

  1. What is Detectron2?

What is Detectron2?

Detectron2 is the object detection and segmentation platform released by Facebook AI Research (FAIR) as an open-source project.

It is a second generation of the library as the first Detectron was written in Caffe2 and then with the maskrcnn-benchmark reimplemented in PyTorch 1.0. Detectron2 is a ground-up rewrite and extension of the previous effort using PyTorch.

FAIR’s team very well states the motivation behind the project:

“We builtDetectron2 to meet the research needs of Facebook AI and to provide the foundation for object detection in production use cases at Facebook. We are now using Detectron2 to rapidly design and train the next-generation pose detection models that power Smart Camera, the AI camera system in Facebook’s Portal video-calling devices. By relying on Detectron2 as the unified library for object detection across research and production use cases, we are able to rapidly move research ideas into production models that are deployed at scale.”

The Detectron 2 model examples
The Detectron 2 model examples

Detectron2 beyond state-of-the-art object detection algorithms includes numerous models like instance segmentation, panoptic segmentation, pose estimation, DensePose, TridentNet.

You can look at the Detectron2 Model Zoo site to find a broad set of baseline results and trained models to start with.

Regards to FAIR’s team, Facebook AI’s computer vision engineers created Detectron2go, which is an additional layer that will allow easy and optimized model deployment to production (but not yet released).

Project setup

Start with cloning the following repository:

$ git clone git://
$ cd detectron2-pipeline
$ git checkout 9460e3806c3ef5208ba8e5b4099fcb75ef6f39d1

The commit 9460e3806c3ef5208ba8e5b4099fcb75ef6f39d1 indicates the source code compatible with the content of this story.

Create and activate the environment using Conda:

$ conda env create -f environment.yml
$ conda activate detectron2-pipeline

The created environment includes all the requirements we need to setup Detectron2 and to run our project.

Next, we need to clone and install Detectron2 itself.

# We want to clone Detectron2 outside of our detectron2-pipeline repo
$ cd ..
$ git clone
$ cd detectron2
$ git checkout 3def12bdeaacd35c6f7b3b6c0097b7bc31f31ba4
$ python build develop

or if you are on macOS:

$ MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python build develop

The commit 3def12bdeaacd35c6f7b3b6c0097b7bc31f31ba4 indicates the Detectron2 source code compatible with the commit 9460e3806c3ef5208ba8e5b4099fcb75ef6f39d1 of the detectron2-pipeline repository. You can check out the latest version of both repositories but I can not guarantee that it will work as described here.

In case of any problems, please refer to the Detectron2 installation guide.

Project structure

├── assets
├── configs
├── environment.yml
├── output
├── pipeline
│ ├──
│ ├──
│ ├──
│ ├──
│ ├──
│ ├──
│ ├──
│ ├──
│ ├── libs
│ │ ├──
│ │ ├──
│ │ ├──
│ │ └──
│ ├──
│ ├──
│ ├──
│ ├──
│ ├──
│ └── utils
│ ├──
│ ├──
│ ├──
│ ├──
│ ├──
│ └──
├── pytest.ini
└── tests

As you can see, it is not a yet-another-hello-word-example project. I created this project to gather some best practices to efficiently process videos and images and to be able to experiment with different models from Detectron2 zoo.

The structure of the project is based on the modular image processing pipeline described in my previous stories:

It was greatly extended with the following elements:

  • faster video reading using a separate thread, so the main thread of the application is not blocked by reading and decoding the frames,

Part of the project content is boilerplate code.

“In computer programming, boilerplate code or just boilerplate are sections of code that have to be included in many places with little or no alteration. When using languages that are considered verbose, the programmer must write a lot of code to accomplish only minor functionality. Such code is called boilerplate.” — Wikipedia

As boilerplate code we can consider:

  • utils/: common utility scripts,

This is the part of the code that we have to include over and over in our computer vision projects.

The custom code is the code interacting with Detectron2 libraries:

  • pipeline/ pipeline task for image annotation,

All the project model configurations are stored in config directory and can be used with--config-file option:

├── configs
│ ├── COCO-Detection
│ │ ├── faster_rcnn_R_50_FPN_3x.yaml
│ │ └── retinanet_R_50_FPN_3x.yaml
│ ├── COCO-InstanceSegmentation
│ │ └── mask_rcnn_R_50_FPN_3x.yaml
│ ├── COCO-Keypoints
│ │ └── keypoint_rcnn_R_50_FPN_3x.yaml
│ └── COCO-PanopticSegmentation
│ └── panoptic_fpn_R_50_3x.yaml

I encourage you to review the code and if you have any questions, don’t hesitate to ask below in the response section of the story.

Image processing

There are two main scripts we can run:

  • to process image(s)

Let’s start with the and look at the available options:

$ python command-line options

By default, the script will process images from the --input directory, perform the instance segmentation and save the results in the --output directory.

$ python -i assets/images/others -p
An instance segmentation example
An instance segmentation example
Instance segmentation (Image by Free-Photos from Pixabay)

We can also try another model from the config directory like a keypoint estimation for this particular image without touching the code and just changing the configuration:

$ python -i assets/images/others/couple.jpg -p --config-file configs/COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml
A key point estimation example
A key point estimation example
Keypoint estimation

The visualization is realized with AnnotateImage from pipeline/ where we use detectron2.utils.visualizer.Visualizer from Detectron2.

The pipeline will run separate processes for the model execution on GPU (if your machine has one) or CPU working in parallel with the root process. You can increase the number of GPUs or CPUs with options: --gpus or --cpus but don’t mix both together.

For implementation details of the multiprocess, asynchronous prediction see pipeline/libs/ and pipeline/ where the code was partially ported from detectron2/demo/

Including more GPUs will always speed up the pipeline execution, but it is not the same case with CPUs. GPU is dedicated for inference where CPU is occupied by a lot of other system processes and threads. From my experiments, more than two CPU workers don’t help, but that could depend on the number of available CPU cores.

Let’s say, you have GPU available and currently have some problems with CUDA installation and configuration, but you want to see some results immediately. Then instead of struggling with the setup, you can force the pipeline to only use CPU providing the --gpus 0 option in the execution command line.

You can also run the processing with a single process using--single-process option.

Video processing

You can try the same with the video or webcam but be warned that realtime video processing with inference is computational and time-consuming, depending on GPU availability. Video stream processing could just feel slow and sluggish.

Let’s options:

$ python -h command-line options

They are almost the same as for the script. The instance segmentation model is also used by default.

If your computer is equipped with a webcam you can test it running this command:

$ python -i 0 -d -p

-i 0 will indicate that your input is the default camera, -d will display the result window and -p will display progress info.

As long as you are not equipped with a decent GPU card or two, realtime camera processing could be lagging a lot. Then I would suggest to save your camera video and process it as below.

To run predictions on a video file, you can trigger:

$ python -i assets/videos/walk.small.mp4 -p -d -ov walk.avi

-ov walk.avi command option will save the output result to output/walk.avi. -d will display the window with the result but if you want to process your video faster just remove this option. Displaying video takes some CPU time. The progress will be visible anyway thanks to -p option.

An instance segmentation on video
An instance segmentation on video
Instance segmentation (Video by sferrario1968 from Pixabay)

Let’s try it for panoptic segmentation on some video file of a traffic road:

$ python -i assets/videos/traffic.small.mp4 -p -d -ov traffic.avi --config-file configs/COCO-PanopticSegmentation/panoptic_fpn_R_50_3x.yaml
A panoptic segmentation on video
A panoptic segmentation on video
Panoptic segmentation (Video by MabelAmber from Pixabay)

Background separation

Now being armed with the Detectron2 model’s arsenal, we are limited only by our imagination in creating and testing unusual computer vision solutions.

I’ve prepared a simple example of background separation using Detectron2 instance segmentation model. The background will be separated by blurring it. It gives you the effect similar to taking the picture with a camera equipped with a lens with a large aperture which will create a shallow depth of field such that the subject is in focus, and the background is out of focus. Additionally, to get a more fancy picture, we can desaturate or grayscale the background.

Background separation algorithm

The algorithm is quite simple (see the source code of pipeline/ First, we perform an instance segmentation using Detectron2 to extract object masks from the scene, and then this information is passed to theSeparateBackground pipeline task. There we do some simple math operations on an input image and mask using OpenCV methods.

# Multiply the foreground with the mask
foreground = cv2.multiply(foreground, mask)
# Multiply the background with ( 1 - mask )
background = cv2.multiply(background, 1.0 - mask)
# Add the masked foreground and background
dst_image = cv2.add(foreground, background)

Run the pipeline with an image:

$ python -i assets/images/others/couple.jpg -sb

to get the picture like shown earlier, or you can run the same task on video:

$ python -i assets/videos/walk.small.mp4 -sb -d
Background separation on a video file


Deep Learning in Computer Vision

Thanks to Sławomir Gilewski

Jarosław Gilewski

Written by

I’m a senior software engineer involved in software development for more than 20 years. Currently, I’m focused on computer vision and deep learning.

Deep Learning in Computer Vision

More From Medium

More from

More from

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade