Nerd For Tech
Published in

Nerd For Tech

Review — Unsupervised Visual Representation Learning by Context Prediction (Self-Supervised)

Self-Supervised Learning: Context Prediction Without Using Ground-Truth Labels

The task for learning patch representations involves randomly sampling a patch (blue) and then one of eight possible neighbors (red)


1. Motivations & Conceptual Ideas

The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled

2. Learning Visual Context Prediction

2.1. A Pair of AlexNet-Like Networks

A Pair of AlexNet-Like Networks for Pair Classification

2.2. Training Samples

2.3. Avoiding “Trivial” Solutions

2.3.1. Low-Level Cues

2.3.2. Chromatic Aberration

3. Implementation Details

4. Experimental Results

4.1. Nearest Neighbors

Examples of patch clusters obtained by nearest neighbors (fc6 features from a random initialization of our architecture, AlexNet fc7 after training on labeled ImageNet, and the fc6 features learned from the proposed method)

4.2. Adopting into R-CNN for Object Detection

Object Detection Network
AP and mAP Results (%) on PASCAL VOC-2007

4.3. Visual Data Mining

Object clusters discovered
Clusters discovered from the Paris Street View dataset



NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store