Review — Unsupervised Visual Representation Learning by Context Prediction (Self-Supervised)

Self-Supervised Learning: Context Prediction Without Using Ground-Truth Labels

Sik-Ho Tsang
Nerd For Tech
7 min readAug 9, 2021


The task for learning patch representations involves randomly sampling a patch (blue) and then one of eight possible neighbors (red)

In this story, Unsupervised Visual Representation Learning by Context Prediction, (ContextPrediction), by Carnegie Mellon University, and University of California, is reviewed.

In standard supervised learning, networks are trained with ground-truth labels. However, it is often expensive and time-consuming to annotate the labels especially when the dataset is large, e.g. ImageNet.

In this paper:

  • The feature representation learned using the within-image context indeed captures visual similarity across images.
  • e.g.: At the above, can you guess the spatial configuration for the two pairs of patches? Note that the task is much easier once you have recognized the object!
  • This is a kind of self-supervised learning.

This is a paper in 2015 ICCV with over 1300 citations. (

@ Medium)


  1. Motivations & Conceptual Ideas
  2. Learning Visual Context Prediction
  3. Implementation Details
  4. Experimental Results

1. Motivations & Conceptual Ideas

  • Internet-scale datasets (i.e. hundreds of billions of images) are hampered by the sheer expense of the human annotation required.
  • A natural way to address this difficulty would be to employ unsupervised learning. Unfortunately, unsupervised methods have not yet been shown to extract useful information from large collections of full-sized, real images.

How can one write an objective function to encourage a representation to capture, for example, objects, if none of the objects are labeled?

  • In the text domain, given a large text corpus, the idea is to train a model that maps each word to a feature vector, such that it is easy to predict the words in the context (i.e., a few words before and/or after) given the vector.
  • This converts an apparently unsupervised problem (finding a good similarity metric between words) into a “self-supervised” one: learning a function from a given word to the words surrounding it.
  • This paper aims to provide a similar “self-supervised” formulation for image data.
The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled

A random pairs of patches are sampled in one of eight spatial configurations. The algorithm must then guess the position of one patch relative to the other.

The underlying hypothesis is that doing well on this task requires understanding scenes and objects, i.e. a good visual representation.

2. Learning Visual Context Prediction

2.1. A Pair of AlexNet-Like Networks

A Pair of AlexNet-Like Networks for Pair Classification
  • Convolutional neural network (CNN) is used to learn an image representation for our pretext task, i.e., predicting the relative position of patches within an image.
  • The network must feed the two input patches through several convolution layers, and produce an output that assigns a probability to each of the eight spatial configurations.

But it is noted that the ultimately goal is to learn a feature embedding for individual patches, such that patches which are visually similar (across different images) would be close in the embedding space.

  • A pair of AlexNet-style architectures is used to process each patch separately, until a depth analogous to fc6 in AlexNet, after which point the representations are fused.
  • Weights are tied/shared (Convs with dashed line) between both sides of the network, such that the same fc6-level embedding function is computed for both patches.

2.2. Training Samples

  • To obtain training examples given an image, the first patch is sampled uniformly, without any reference to image content.
  • Given the position of the first patch, the second patch is sampled randomly from the eight possible neighboring locations.

2.3. Avoiding “Trivial” Solutions

2.3.1. Low-Level Cues

  • Care must be taken to ensure that the task forces the network to extract the desired information (high-level semantics), without taking “trivial” shortcuts.
  • low-level cues like boundary patterns or textures continuing between patches can be treat as shortcuts.

It was important to include a gap between patches (approximately half the patch width). Each patch location is randomly jittered by up to 7 pixels.

2.3.2. Chromatic Aberration

  • Another problem is the chromatic aberration.
  • The lens focuses light at different wavelengths. In some cameras, one color channel (commonly green) is shrunk toward the image center relative t the others.
  • The network can learn trivial solution by detecting the separation between green and magenta (red + blue), which is a kind of shortcut that we don’t want the network to learn.
  • To deal with this problem, two types of pre-processing are tried.
  • One is to shift green and magenta toward gray (‘projection’).
  • Specifically, let a = [−1, 2,−1] (the ‘green-magenta color axis’ in RGB space).
  • B is a matrix that subtracts the projection of a color onto the green-magenta color axis. Every pixel value is mulitpled by B.
  • An alternative approach is to randomly drop 2 of the 3 color channels from each patch (‘color dropping’), replacing the dropped colors with Gaussian noise (standard deviation 1/100 the standard deviation of the remaining channel).

3. Implementation Details

  • ImageNet 2012 training set (1.3M images), but discarding labels.
  • Each image is resized to between 150K and 450K total pixels, preserving the aspect-ratio.
  • Patches are sampled at resolution 96×96, in a grid-like pattern.
  • A gap of 48 pixels between the sampled patches in the grid, but also jitter the location of each patch in the grid by −7 to 7 pixels in each direction.
  • Patches are preprocessed by (1) mean subtraction (2) projecting or dropping colors, and (3) randomly downsampling some patches to as little as 100 total pixels, and then upsampling it, to build robustness to pixelation.
  • Batch normalization is used for those conv layers without using LRN. (According to this Github.)

4. Experimental Results

  • The trained network is applied in two domains.
  • First, pre-training, for a standard vision task with only limited training data: specifically, the VOC 2007 object detection.
  • Second, visual data mining, where the goal is to start with an unlabeled image collection and discover object classes.
  • Finally, the performance is analyzed on the layout prediction “pretext task” to see how much is left to learn from supervisory signal.

4.1. Nearest Neighbors

Examples of patch clusters obtained by nearest neighbors (fc6 features from a random initialization of our architecture, AlexNet fc7 after training on labeled ImageNet, and the fc6 features learned from the proposed method)
  • The goal is to understand how similar the patches are.
  • fc7 and higher layers are removed. NN are applied on the fc6 features.
  • Results for some patches (selected out of 1000 random queries) are shown above.
  • The proposed context prediction algorithm performs well.
  • For the middle AlexNet, it outperforms the proposed context prediction approach. Because this AlexNet is supervise-learned.

By visualizing the patches grouped by NN, the predicted context prediction is able to learn the image representation without labels.

4.2. Adopting into R-CNN for Object Detection

Object Detection Network
  • R-CNN pipeline is used.
  • However, 227×227 input is used rather than 96×96. The network needs to be modified.
  • A pool5 is performed 7×7 spatially.
  • The new ’conv6’ layer is created by converting the fc6 layer into a convolution layer.
  • conv6 layer has 4096 channels, where each unit connects to a 3×3 region of pool5.
  • Another layer after conv6 (called conv6b), using a 1×1 kernel, is added to reduce the dimensionality to 1024 channels.
  • The outputs are fed to a fully connected layer (fc7) through a pooling layer which in turn connects to a final fc8 layer which feeds into the softmax.
  • conv6b, fc7, and fc8 start with random weights.
  • fc7 is used as the final representation.
AP and mAP Results (%) on PASCAL VOC-2007
  • Scratch-Ours: The architecture trained from scratch (random initialization) performs slightly worse than AlexNet trained from scratch (Scratch-R-CNN).

Ours-projection/Ours-color-dropping: Pre-training boost the from-scratch number by 6% mAP, and outperforms an AlexNet-style model trained from scratch on Pascal by over 5%.

Only 8% behind ImageNet-R-CNN, i.e. R-CNN pre-trained with ImageNet labels. This is the best result (at that moment) on VOC 2007 without using labels outside the dataset.

  • Ours-Yahoo100m: A randomly-selected 2M subset of the Yahoo/Flickr 100-million Dataset [51] is used, which was collected entirely automatically. The performance after fine-tuning is slightly worse than Ours-projection/Ours-color-dropping, but there is still a considerable boost over the from-scratch model.
  • Ours-VGG: VGGNet is tried and mAP is closed to ImageNet-R-CNN one.

4.3. Visual Data Mining

Object clusters discovered
  • Visual data mining, or unsupervised object discovery aims to use a large image collection to discover image fragments which happen to depict the same semantic objects.
  • First, sample a constellation of four adjacent patches from an image.
  • Then, find the top 100 images which have the strongest matches for all four patches, ignoring spatial layout.
  • (filter away the images where the four matches are not geometrically consistent.)
  • The number beside each cluster indicates its ranking, determined by the fraction of the top matches that geometrically verified.
  • Some of the resulting patch clusters are shown above.
Clusters discovered from the Paris Street View dataset
  • The proposed representation captures scene layout and architectural elements.

Pre-training using proposed Context Prediction outperforms random initialization while it does not need ground-truth labels comparing with supervised pre-training strategy.



Sik-Ho Tsang
Nerd For Tech

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.