Human vs. Deep Learning AI: Who Can Compare Images Better?

Deep Cognition
Apr 16, 2018 · 4 min read
The Totally-Looks-Like dataset: pairs of perceptually similar images

Artificial intelligence (AI) is breaking ground right before our eyes. From virtual assistants present in every smartphone to their invasion of smart speakers, it is clear that they hold potential in making lives easier. However, given the fact that it is still a developing technology and a machine, it is bound to have some limitations.

Virtual assistants like Siri and Alexa are dependent on the Internet to perform searches, check weather updates, and business operating hours. Other similar technologies like deep-learning machines also follows the same direction, they are handicapped by the fact that they newly developed technologies, with its shortcomings only being detected as they are used.

Rosenfeld and colleagues from York University Toronto noticed the same limitation and questioned whether such applies to deep-learning machines’ capability in the field of computer vision. In their 2018 study “Totally Looks Like — How Humans Compare, Compared to Machines,” (link to original paper) they sought to find how humans and machines fare on judging image similarities. Their objective — to come up with a conclusion on whether the human mind is more capable than a deep neural network in identifying similarities in a dataset of images.

The researchers made use of the images stored on TotallyLooksLike (TLL), a for-entertainment website that houses pairs of images that were identified by users as appearing similar to each other. Images included in the website are those of objects, animals and faces among others. Despite the fact that images included in TLL are not as expansive as it should be, Rosenfeld and colleagues said “the diversity and complexity of the images in the dataset implicitly captures many aspects of human perception of image similarity, beyond current datasets which are larger but at the same time narrower in scope.”

A total of 6,016 images were pooled to be part of the study, which were presented for similarities identification to humans and deep-learning machines. Given that deep-learning machines have made significant advancements in recent years, the general conclusion would be that they are at-par or perform better than the ability of the human mind in terms of identifying image similarities.

But, who did better? Humans and deep-learning machines did have their each set of advantages of the other. Humans for starters have a wide databank of images at their disposal prior to being subjected to the test. This means that their ability to identify similarities between images are far developed even after fine tuning the ability of the deep-learning machine. “(W)e believe that generic enough visual features should be able to reproduce the same similarity measurements without being explicitly trained to do so, just as humans do.” say Rosenfeld and colleagues.

Another advantage seen with the human mind determining image similarities is when both humans and machines were presented with a cartoon image and that of a photograph of the face of actor Nicolas Cage. In this experiment, humans fared better than machines partly due to the fact that the human mind can easily point to facial features where the cartoon character and the human face shares similarities.

Automatic retrieval errors: using distances between state-of-the-art deep learned representations often does not do well in reproducing human similarity judgments. Each row shows a query image on the left, five retrieved images and the ground-truth on the right.

But those do not mean to say that deep-learning machines are incapable in the field of computer vision. The study also disclosed that in the cartoon-Nicolas Cage identification, deep-learning machines do not fall far behind the ability of humans. “(B)oth humans and machines must be “multi-modal and conditioned on both images: examples of such factors are (1) facial features (2) facial expressions (3rd row in Figure 4), requiring a robust comparison between facial expressions in different modalities (3) a similarity of the texture or structure of part of the image (last row, person’s hair),” add the researchers.

This endeavor is not the first to be taken. Several studies that have pitted human capability and that of a machine in performing some action related to associating high-level image attributes. One identified several discrepancies on the abilities of humans and computational machines on human similarity measurements while another found that human ability fared better than machines when they were subjected to compared a normal image and its distorted version.

Rosenfeld and colleagues’ conclusion is not far from discoveries of other researchers in the field of computer vision. It is a given that the human mind has been exposed to a number of factors well before the study was conducted. Given such, biases are evident when it is pitted against a newly-developed technology.

The future of deep-learning machines in the field of computer vision still holds promise. There are some fine tuning that needs to be done. This will enable these machines to be as equally capable as the human mind in terms of identification of image similarities. When asked for recommendations on future studies of similar nature, “(W)e suggest that the comparison will be akin to visual-question-answering (VQA), in the form “why should image A be regarded as similar / dissimilar to image B?” say the researchers.

This initial research gives us a glimpse of how computer vision is different than human vision for similarity tasks. More exploration and research is needed to explore further and building computer vision that matches human performance.