Paper Review 1 — Reading digits in Natural Images with Supervised Feature Learning

3 min readApr 10, 2018

This paper published in NIPS 2011 is rather old paper. But it had some interesting ideas at that time. The paper describes how using automated feature learning using auto-encoders lead to vastly improved performance of identifying street-numbers in real-world images. Performance was compared against traditional HOG and hand-crafted feature engineering methods.

Now-a-days, almost all CV papers hardly show custom hand-crafted features of image data. Using some deep-learning methodology to get a representative feature vector for an image (to be then used in some classifier) has become an established practice. Goes to show the advances made in the CV field.

Data

Coming back to this paper, another important contribution is in the generation of the SVHN (street view house numbers) dataset. The data consists of little over 73K images for train and 26K images for test. It also has a larger extra dataset of 531K images.

This data was generated from Google street view images using the following process — 1. a large set of images were randomly sampled from various urban areas in various countries. 2. using sliding window house detector, those images with house numbers were selected. To limit bias from the sliding window detector, thresholds were chosen such that more FP images were sent to Amazon Mechanical turk workers. 3. AMT workers further manually labelled the images. This generated the train and test datasets. The extra images were chosen with high-precision but low-recall leading to perhaps those images with “easy” to detect house numbers.

Methodology

The whole pipeline is broken down into 1) detection stage — i.e. locating individual house numbers in an image 2) recognition stage — classification of the house number in the detected house number. This paper shows improvements in the step 2, recognition stage.

The paper uses as baseline — HOG features and hand-crafted binary features and automatic features generated from unsupervised methods like stacked sparse auto-encoders and k-means. This was before CNN’s were mainstream and end-to-end back-propagation was possible. The stacked auto-encoders were trained layer-wise and then stacked. The decoder was then discarded. The encoder is then used as a nonlinear function that maps input images to a K-dimensional feature vector. This was then used in a linear SVM for digits classification.

The another unsupervised feature extraction was K-means based where K linear filters were learnt (K being the centroids).

Assumptions

The street numbers in images are aligned horizontally and there is no overlap between them. If the images are randomly rotated, perhaps these algorithms performance would be degraded severely. The authors themselves suggest a deskewing processing step to correct such images.

Results

The results clearly show that the unsupervised feature generation techniques blew the traditional hand-crafted features out of water. K-means performed better than stacked sparse auto-encoders. Its interesting to note that human performance was measured at 98%. A section on human performance also explains that human errors were more in areas where image “context” is missing (by showing humans only part of the house number)or when the image itself is blurry. Another interesting observation (which has become common knowledge now) is that more data increased the accuracy of the unsupervised algorithms.

Didn’t completely get

On page 5 (the para above section 4), where the authors describe how K-means based feature learning system was used to extract image features is not entirely clear to me. How the large bank of K linear filter were learnt is not clear.

Conclusion

Overall an interesting paper to read in 2018. It shows the advances in CV filed in general and how this problem is still not fully solved as MNIST.