YOLACT (Real time Instance Segmentation)

Anmol Dua
Anmol Dua
Aug 16, 2019 · 4 min read

(Important Points)


Break up the complex task of instance segmentation into two simpler, parallel tasks that can be assembled to form the final masks.

First Task or Prototype Generation Branch

The first branch uses an FCN to produce a set of image-sized “prototype masks” that do not depend on any one instance.

The prototype generation branch (protonet) predicts a set of k prototype masks for the entire image. Implemented protonet as an FCN whose last layer has k channels (one for each prototype) and attach it to a backbone feature layer. This formulation is similar to standard semantic segmentation, it differs in that this model exhibit no explicit loss on the prototypes. Instead, all supervision comes from the final mask loss after assembly.

Finally, they find it important that the protonet’s output being unbounded, as this allows the network to produce large, overpowering activations on prototypes it is very confident about (e.g., obvious background).

Used ReLU for more interpretable prototypes.

Second Task or Mask Coefficient Branch

The second adds an extra head to the object detection branch to predict a vector of “mask coefficients” for each anchor that encode an instance’s representation in the prototype space.

Typical anchor-based object detectors have two branches in their prediction heads: one branch to predict c class confidences, and the other to predict 4 bounding box regressors.

For mask coefficient prediction, we simply add a third branch in parallel that predicts k mask coefficients, one corresponding to each prototype. Thus, instead of producing 4 + c coefficients per anchor, we produce 4 + c + k.

Last Step or Mask Assembly

For each instance that survives NMS we construct a mask for that instance by linearly combining the work of these two branches.To produce instance masks, we combine the work of the prototype branch and mask coefficient branch, using a linear combination of the former with the latter as coefficients. Follow this by a sigmoid nonlinearity to produce the final masks.

M = σ(P CT)

where P is an h×w×k matrix of prototype masks and C is a n × k matrix of mask coefficients for n instances surviving NMS and score thresholding.CT is transpose of C Matrix.


3 losses to train our model:

  • classification loss
  • box regression loss
  • mask loss

Classification and Box Regression Loss are calculated in the same way as other Object Detectors.To compute mask loss, we simply take the pixel-wise binary cross entropy between assembled masks M and the ground truth masks.

Mgt: Lmask = BCE(M, Mgt)


The general consensus around instance segmentation is that because FCNs are translation invariant, the task needs translation variance.

Thus methods like FCIS and Mask R-CNN try to explicitly add translation variance, whether it be by directional maps and position-sensitive repooling,(FCIS) or by putting the mask branch in the second stage (Mask R-CNN) so it does not have to deal with localizing instances.

In YOLACT, the only translation variance added is to crop the final mask with the predicted bounding box. However, their method also works without cropping for medium and large objects, so this is not a result of cropping.

YOLACT learns how to localize instances on its own via different activations in its prototypes.

  • Prototype activations for the solid red image (image a) in Figure 5 are actually not possible in an FCN without padding.
  • The consistent rim of padding in modern FCNs like ResNet gives the network the ability to tell how far away from the image’s edge a pixel is. This means ResNet, for instance, is inherently translation variant, and Yolact makes heavy use of that property (images b and c exhibit clear translation variance)
  • Observe many prototypes to activate on certain “partitions” of the image. That is, they only activate on objects that are on one side of an implicitly learned boundary. In Figure 5, prototypes 1, 4, 5, and 6 are such examples (with 6 partitioning the background and not the foreground).

By combining these partition maps, the network can distinguish between different (even overlapping) instances of the same semantic class.

Protonet combines the functionality of multiple prototypes into one, the mask coefficient branch can learn which situations call for which functionality

For implementing YOLACT , Thanks to dbolya for this repository:

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade