The end of the era ImageNet

Published in

Aleph Alpha Blog

4 min readFeb 14, 2022

World-models are overtaking specialized and supervised models in performance and generalizability. With our new research we lift this development from the language space (with models like GPT-3) into a multimodal space, while retaining the impressive functionality known from large language models.

The supervised world for visual data

ImageNet was a tremendously influential dataset and publication. For the first time a computer vision dataset was available that was so big and diverse that it could unlock the potential of emerging larger deep networks. ImageNet paired an image with a label, one of more than 20.000 categories that were human annotated and describe the main motive of the image. This was the spark that ignited modern image classification and later object detection (when labels moved to bounding boxes and pixel masks). The following period of innovation was directly responsible for the success of my earlier start-up and the boom in autonomous systems overall.

World knowledge and understanding cannot be (only) supervised

While great at the time from today’s perspective, especially when compared with the capabilities of self-supervised language models like GPT-3 the limitation to seemingly arbitrary categories, defined by PhDs after maybe too much coffee, seems very restrictive. What about the reflection of a pedestrian in a window front? Someone dressed up for carnival? Or a poster ad that shows a full-sized human?

For humans these observations are not a problem at all. We have all the world knowledge to understand these things and make sense of their implications. Putting this world knowledge into labels for a classifier to learn is hopeless (although we tried, one of the most noteworthy attempts by me and team was a dataset in which we had masks on layers on top of each other, with properties like “semi-transparent”, “light-effect” and “reflection”).

HCI Benchmark Suite

When I looked at the early results of our multimodal model I knew that the era ImageNet is coming to an end. If we are smart about combining a giant language model with one (or more) image representations we can leverage language power and complexity.

The amazing thing about language is that it is specifically built to capture the complete relevant complexity of our world and map this to our understanding. This is why large language models are so powerful: they piggyback on millennia of human understanding encoded into standardized form. For very large models this seems to implicitly include things like answering detailed questions, writing summaries and comparing semantics and meaning.

If we succeed in combining this with a visual understanding we immediately gain all this power in a visual domain. Tell a story inspired by some images? Describe the difference of two observations? Answer a question about image content? read text? Learn new visual concepts? All those are suddenly possible with the almost unlimited power of language expression. Why would you ever go back to labels?

I am sure there are some good reasons for certain use-cases however looking at our first results I believe the era of ImageNet is over. Our model outperforms bert-based techniques for Outside Knowledge Visual Question Answering, as the model retains encyclopaedic knowledge from the language model. A more technical writeup by the research team is coming soon.

Thanks, ImageNet for all the fish.

To inspire your ideas — our large multimodal world-model:

almost unlimited representation power of language and images
language and images can be combined freely, in any order and quantity (e.g. by adding several images in a context with implicit meaning in the sequencial oder)
a known large language model tricks work (like QA)
learns new visual understanding few shot
better in reading than most dedicated OCR systems
can be combined with our other tools (WorldPointer, HybridInterface)
I can’t wait what else our partners will discover
won’t fit on your gaming GPU, sorry

Some examples out of the lab, still WIP but you can already see the possibilities:

World-class OCR combined with context understanding (what is relevant — how would a human answer)