Cutout- Dropout in input space with Albumentations implementation

3 min readJun 24, 2023

Cutout employs the strategy of randomly removing multiple small patches from an image, as opposed to a single patch in context autoencoders, thereby causing the features associated with these patches to be omitted throughout the network. In contrast, dropout explicitly eliminates features at a particular layer. This simple yet effective technique has been shown to achieve improved accuracy and faster convergence compared to models that do not utilize it.

A similar technique is used in Masked Autoencoders. we will see how we can implement this effectively by making use of einops library.

Algorithm

create a grid of image patches.
Drop x% of the patches and fill them with zeros.
recreate the image with patches.

To implement this technique, we start by determining the patch size and the percentage of data to be masked as patches in the initial step. We will integrate this functionality into the albumentations library, catering to both image and mask processing. Extending this approach to bounding boxes is possible, although it necessitates additional effort. We will delve into the details of this extension later in our discussion.

Step-1 — Grid the image.

to create a grid of image patches, we can first rearrange them as (total patches, pixel_data). this will make the data from 3D to 2D and dropping some rows is simple.

The einops rearrange does this effienctly.

import numpy as np 
from PIL import Image
from einops import rearrange

x = Image.open("cow.png").resize((320, 224)).convert("RGB")
img = np.asarray(x) #(224, 320, 3)
rimg = rearrange(img, '(h p1) (w p2) c -> (h w) (p1 p2 c)', p1=8, p2=8)
rimg.shape # (1120, 192)

Step-2 — select rows to mask.

rimg[np.random.randint(1120, size=int(0.4*1120)), :] = 0

Step-3 recreate the image

timg = rearrange(rimg, '(h w) (p1 p2 c) -> (h p1) (w p2) c', p1=8, p2=8, h=224//8, w=320//8)
timg.shape #(224, 320, 3)

Step-4 Implementing using Albumentations

from albumentations.core.transforms_interface import DualTransform
 
class RandomCutout(DualTransform):
    """randomly strip image and mask
    """
    def __init__(self, always_apply=False, p=0.1, patch_size=(8, 8), mask_perc=0.1, renormalize=False):
        super().__init__(always_apply=always_apply, p=p)
        self.pz = patch_size 
        self.mask_perc = mask_perc
        self.renormalize=renormalize
        
    def apply(self, image, mask_tokens=None, **params):
        if self.renormalize: image = 2*image-1
        return self._apply(image, mask_tokens)
    
    def apply_to_mask(self, mask, mask_tokens=None, **params):
        return self._apply(mask, mask_tokens)
    
    def _apply(self, img, mask_tokens):
        h, w= img.shape[:2]
        if len(img.shape) == 2: img = np.expand_dims(img, 2)
        rimg = rearrange(img, '(h p1) (w p2) c -> (h w) (p1 p2 c)', p1=self.pz[0], p2=self.pz[1])
        rimg[mask_tokens, :] = 0
        timg = rearrange(rimg, '(h w) (p1 p2 c) -> (h p1) (w p2) c', \
                 p1=self.pz[0], \
                 p2=self.pz[1], \
                 h=h//self.pz[0], w=w//self.pz[1])
        if len(img.shape) == 2: timg = np.squeeze(timg, 2)
        return timg
    
    @property
    def targets_as_params(self):
        return ["image"]
    
    def get_params_dependent_on_targets(self, params):
        height, width = params["image"].shape[:2]
        tokens = (height//self.pz[0])*(width//self.pz[1])
        mask_tokens = np.random.randint(tokens, size=int(self.mask_perc*tokens))
        return dict(mask_tokens=mask_tokens)
        
    def get_transform_init_args_names(self):
        return (
            "p",
            "pz",
            "mask_perc"
        )

According to the paper, cutout demonstrates better performance when the image is normalized between the range of (-1, 1) instead of (0, 1). So, if the input image pixel values fall within the range of (0, 1), it is recommended to set the parameter “renormalize” to True in order to maintain consistent normalization.

I have tested this code and it usually takes 1–10ms depending on the image size (224–512). Comment below if there is an alternative way of implementation which is much faster.

Bounding boxes

In the case of bounding boxes, the output will remain the same as the input in most scenarios since the occlusion criterion often encompasses most objects. However, if the patch size is comparable to the sizes of very small objects, it is advisable to calculate the Intersection over Union (IoU) between the patches and the ground truth bounding boxes. By setting a threshold intuitively based on the data, you can remove ground truth boxes that exhibit a high IoU beyond that threshold.

Upvote if this works for you. Thanks

Resources

An alterative implementation is present here https://github.com/prakashjayy/computer_vision/blob/master/papers/cutout.ipynb
Read the paper
einops library rearrange