VizWiz: Computer Vision Researchers Join Forces for Social Good
Artificial intelligence is not just making our lives easier by automating mundane and tedious tasks, but it is unlocking myriad possibilities to people with disabilities and promising them unique ways of experiencing the world. The industry of assistive technology has advanced way beyond wheelchairs, prostheses or vision and hearing aids thanks to AI-powered technologies. For instance, visual assistive technologies like Seeing AI or OrCam are helping blind people overcome daily challenges by facilitating simple everyday tasks and breaking down accessibility barriers. Computer vision methods such as object recognition, scene understanding, Visual Question Answering (VQA) and Visual Dialogue hold great promise in making the lives of blind people much more easier.
In this regard, the research community has been working on using computer vision models to advance visual assistive technologies for blind people. Almost 10 years ago, a group of researchers developed the VizWiz app, which enabled blind users to take pictures with their phones, ask questions about these pictures and receive almost real-time spoken answers from remote sighted employees. Fast-forwarding to now and in light of recent advances in VQA models, researchers from the computer vision community took advantage of the data collected through the app and put together the VizWiz dataset using over 31,000 questions collected from blind people. In compliance with privacy restrictions, a rigorous filtration and anonymization process was first implemented on the data to eliminate any samples that could reveal individuals’ identities. The remaining question, however, was how we could develop ‘sighted’ VQA models under natural settings; allowing blind people to capture images of objects, ask questions about these images and get timely spoken answers.
VizWiz Grand Challenge: Pushing Visual Assistive Technology Research
This year’s ECCV conference featured a VizWiz Grand Challenge with the aim of urging the research community to join forces, solve the challenges of the VizWiz dataset and the VQA task at large, and come up with new approaches that meet the needs of blind people. We are committed to working on harnessing the power of artificial intelligence for social good, and that is why the SAP Leonardo Machine Learning Research team participated in the challenge and was among the top three performing teams. We also presented an extended abstract at the VizWiz workshop elaborating on our solutions and highlighting the shortcomings and limitations of current VQA models and evaluation metrics.
Challenges and Limitations of the VizWiz Dataset
Whereas VQA models and algorithms have already shown remarkable progress over the past few years; they usually perform well under artificially curated datasets with high-quality and clear images, and direct written questions that the algorithm can easily identify and respond to. However, when it comes to deploying such algorithms in real-world scenarios, there are several shortcomings and limitations.
Unlike standard VQA datasets, the VizWiz dataset is based on real-life data originating from blind people; making this dataset both appealing and challenging. For example, the images provided by blind people are often of poor quality (e.g., see the fuzzy pictures in the second row of figure 1). Moreover, the questions asked are mostly conversational or suffer from audio recording issues. Additionally, in many cases, the questions cannot be answered because of irrelevant or out of focus images of the object related to the question (e.g., see “unanswerable” and “unsuitable” examples in figure 1). To solve these issues, the VizWiz Grand Challenge included two tasks for the dataset: 1) Predict the answer to a visual question and 2) Predict whether a visual question cannot be answered.
Our Solutions of the VizWiz Challenge
Task One: Predicting Answers to Visual Questions
Our solution for the first task centralizes around the uncertainty or subjectivity of most answers in VizWiz. We exploit the notion of “uncertainty-aware” training in VQA models. We model the uncertainty of an answer according to the agreement between human annotators, i.e., the frequency of each answer in the ground truth set. We employ a loss function that takes into consideration the contribution and the uncertainty of each answer given by human annotators. The loss is computed as the weighted average of the negative log-probabilities of each unique ground-truth answer. This allows for optimizing simultaneously for multiple correct answers.
Task Two: Predicting whether a Question is Answerable
For the second task of the challenge, we use a binary model similar to the one used for predicting the answer, but we train it this time with binary labels (answerable/unanswerable). Preliminary analysis on the dataset showed that most of the samples are answerable. Since the evaluation metric for predicting whether or not a visual question can be answered is Average Precision, we balanced the dataset by up-sampling the unanswerable samples. This allowed us to outperform the state of the art also in the second task.
A Close-Up Look at VQA Task and its Evaluation Metrics
Solving the challenge is only the first step towards fine-tuning VQA algorithms to assist blind people in overcoming their daily challenges. We elaborated on our solutions and examined various shortcomings of VQA models in our extended abstract: “When the Distribution is the Answer: An Analysis of the Responses in VizWiz.” On the one hand, we analyzed the distribution of the answers in the VizWiz dataset and showed how it is skewed towards very few frequent answers. Models can exploit this imbalance to achieve state-of-the-art performance by merely predicting the most frequent answers without actually learning to understand the images and the questions asked. On the other hand, current VQA evaluation metrics have multiple flaws. Firstly, it does not capture the semantic similarities between different answers, e.g., “Dog” and “Chihuahua” are considered as different as “Dog” and “Cake.” Secondly, it does not account for the subjectivity of provided answers. Thirdly, the possibility of gaining accuracy even when the predicted answer appears only once or twice in the ground truth set incentivizes the model to predict the “safe” answer, which is the most frequent one. We believe that the research community needs to continue working on addressing these shortcomings to develop more robust and accurate VQA models that we can rely on in real-world scenarios, e.g., in the context of visual assistive technologies.
Could Computer Vision be an ‘Artificial’ Eye to Blind People?
As the technology is still in its infancy, the significant breakthrough is yet to come. The VizWiz challenge initiated dialogues among the research community for stimulating further research to develop VQA systems tailor-made to fit the needs of blind people. Ultimately, the goal is to develop algorithms that can perform well under several limiting factors often found in real-world situations such as data scarcity, label imbalance, noisy labels, grounding of concepts, and composability. This would lead to cutting-edge visual assistive technologies that someday promise to transform the lives of blind people by facilitating the simple tasks of everyday life and granting them more independence and freedom.
About the Author: Denis Dushi is a master’s student at the Polytechnic University of Milan and KTH Royal Institute of Technology in Stockholm. During his internship at SAP Leonardo Machine Learning Research, he focused on Visual Question Answering and Domain Adaptation.