Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

Member-only story

Almost Any Image Is Only 8k Vectors

23 min readMar 30, 2022

--

Figure 1. Not only has some of the variational family of autoencoders significantly improved in reconstruction fidelity of images, but encoders can also now compress and map input images to a fixed vocabulary of learned vectors. This has enabled the use of this fixed vocabulary to represent images as a token stream for self-supervised learning of images like we do in NLP. Additionally, this reduced-dimensionality representation also brings down the burden on the downstream models to do compression of bits without information loss along with semantic feature extraction. However, the fixed vocabulary mapping of input images has largely been used for generative tasks. Utilizing the same approach for discriminative tasks has not been tried to date. If that proves successful, such a vocabulary would qualify as the word analog for images. This post examines two candidates for mapping input to a fixed vocabulary, one of which is quite promising — it maps the input image (bird on left) to a fixed vocabulary of 8192 learned vectors (distributed representation). This combination of 8192 vectors is then fed to a decoder that reconstructs the original image with almost no perceptual loss (bird on right). Image of a bird on the left from the DIV2k dataset. This post uses images from the DIV2K dataset with explicit permission from the authors. Thank you Radu Timofte.

Overview

The search for an answer to the question “what is the analog of words in images?” appears to be broadly proceeding along two paths (with a few exceptions) driven by multiple factors —

  • the nature of the task being solved (discriminative or generative )
  • how a model is trained (self-supervised vs supervised)
  • the nature of the training process (e.g. BERT style masking input and predicting masked inputs) etc.

This question has come to the fore recently in large part due to the success of transformers in vision for both discriminative tasks (classification, segmentation, object detection) and generative tasks, rivaling if not surpassing its convolutional counterpart architectures in standard benchmarks. Particularly, the desire to emulate, in vision, the success of self-supervised transformer models like BERT that learns a fixed vocabulary of words to represent any input, brings up this question because pixels don't seem to be the analog of words — they are closer to characters than words. Even the analogy of pixels to characters feels like a stretch — the number of color combinations for a pixel makes it a very large…

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Ajit Rajasekharan
Ajit Rajasekharan