Tiki AI — A virtual Assistant for VQA

Brent B
Voice Tech Podcast
Published in
4 min readAug 12, 2019

Tiki was developed for UC Berkeley’s MIDS W251 Final Project by Brent Biseda, Vincio De Sola, Pri Nonis, and Kevin Stone.

https://github.com/facebookresearch/pythia

The goal was to create a prototype for a virtual assistant that can combine a live video stream with normal English syntax. In my view, this is a product that will become commonplace for Amazon Alexa / Apple Sirit, etc. over the next few years.

We based our backend VQA on Facebook Research’s state of the art Pythia model. Pythia is built with Pytorch and was the winning model for VQA 2018 challenge.

Tiki makes any video stream queryable via text or voice. Users can use natural language to query the image:

  • “What objects are in the image?”
  • “Which # is taking a shot?”
  • “What is the weather outside?”

Tiki works runs in any browser and requires no installation of an app.

Prototype Screenshot

Diagram of our selected architecture is shown below:

Project Infrastructure
APIs and Models

There are a number of applications for VQA and virtual assistants. A few examples of uses are shown below. I would suggest that there is a market for home surveillance or creating alerts for a user. For instance, I have a nest camera installed in my home, however, you immediately disable the alert for movement as it happens constantly. Instead, you could use some intelligent query and say “alert me if the baby wakes up”.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Is Tiki an optimist or pessimist?

Do I need to get my baby from crib?

Pythia Technical Details

  • Used object detector to extract image features with bottom-up attention.
  • ResNet-101 for backbone network.
  • Uses Visual Genome, knowledge base to connect structured image concept to language.
  • The question text is then used to compute the top-down attention
  • Uses GloVe (Global Vectors) word embeddings -> GRU network and a question attention module to extract text features
  • Reached 70.34% on VQA 2.0 with an ensemble of 30 models.

Flask Implementation Performance

Initialization of the model takes 16 seconds on a V100, with inference taking 1.5 seconds. Due to the cost of running the V100 server, the machine is no longer running.

Initialization: 16.125 Seconds
Tiki : Initializing : Device Type is cuda
Tiki : Initializing : Building - Text Processors
Tiki : Initializing : Building - ResNet152
Tiki : Initializing : Building - Detectron
Tiki : Initializing : Building - PythiaVQA
Inference: 1.5 seconds

Google Cloud Speech to Text Overview

We also run a flask server that accepts wave files that are then sent to the Google Cloud for interpretation.

  • Processes real-time streaming or prerecorded audio
  • Can return recognized text from audio stored in a file
  • Capable of analyzing short-form and long-form audio
  • Can stream text results, immediately returning text as it’s recognized from streaming audio or as the user is speaking
  • API recognizes 120 languages and variants
  • Automatically identifies spoken language

Conclusion

Visual question answering (VQA) is a new field that combines both computer vision and NLP to provide answers to simple questions using common human syntax. This project is an implementation of a state-of-the-art VQA model (Pythia) in a web app. Over time, this type of service will become more common throughout all smart devices such as Alexa, Siri, and others. Tiki demonstrates the potential for this rollout to be imminent. Here we demonstrate the capacity to use a voice query to provide answers for any iamge on a remote camera.

Pythia Service in Action

References

  • Towards VQA Models That Can Read. Singh, Amanpreet and Natarajan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Batra, Dhruv and Parikh, Devi and Rohrbach, Marcus. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [https://github.com/facebookresearch/pythia]

See the Code:

--

--