Developing an Offline Mobile Navigation App for Visually Impaired People
¹ DevOps Engineering Intern, RubensteinTech, NYC
² Forecasting & Business Analyst Intern, Novo Nordisk, NJ
The current navigation apps on smartphones vary in their functionality and requirements. Although there are numerous navigation platforms, an app to navigate the visually impaired people who do not have mobile network or access to WiFi does not exist. Assuming that a blind/visually impaired individual is walking on the sidewalk, but he unfortunately loses balance and falls and although he does not have access to popular existing apps because he does not own a mobile service or have access to WiFi, he can utilize the simplicity of MobileAI — a smartphone app that is pre-trained to use the camera’s video or image data to predict the user’s geographic location, inform them vocally, and recommend them with routes. The app uses object detection done by the InceptionV3 module, which is a Keras pre-trained application for deep learning models, and it classifies the objects in the video or image taken by the app user to output the user’s location using the pre-trained locations. The goal of this app is to assist the visually impaired or blind with mobility and the feeling of independence without needing the external human help or internet connections on their smartphone.
In 2018, there were approximately 1.3 billion people with a form of visual impairment globally. World Health Organization states that approximately 80% of this disease can be avoided by the several tools and services developed by them . A study accredited by National Institute of Health proclaimed that the visual impairment cases are expected to double to more than 8 million by 2050 in the U.S. Additionally, 16.4 million American are expected to face difficulties in vision due to physiological errors . Although there is ongoing research for treatments, there are limited solutions to help them with accurate navigation during emergency. Needless to say, their devices are required to have network or be connected to WiFi in order to access assistance.
Since the most-commonly used navigation apps like Google Maps, Apple Maps, and Waze work internationally, their margin of errors can cause inaccurate outputs. More common outputs of technical failure are the internet connectivity and cellular network failure, as seen in Figure 1.
The proposed app called MobileAI, also pronounced Mobile-Eye, was developed as a prototype to perform object detection from the video or image captured by the app and output the user’s location. As the second task of the app was to yield accurate real-time locations, this method provides greater granularity of visual cues. The information about the surroundings of the visually impaired individuals is processed using a Keras Application called InceptionV3  to inform them of their location.
Since the new age of computing, convolutional neural networks, or CNNs, have swiftly become the core of most sophisticated computer vision solutions for myriad tasks. It was used to power the core functionality of the proposed application. The InceptionV3 model sits at the fundamental root of the application for it to work and do its tasks. The model — as based on the original paper of “Rethinking the Inception Architecture for Computer Vision” by Szegedy, et. al.  — is the culmination of many ideas developed by multiple researchers over the years.
Helping a visually impaired or blind individual to navigate in his locality or surrounding can be done by simply saying “Hey Google, open MobileAI” to their phone. The person then walks around his locality, while using the app as a navigator to assist him in exploring his surroundings without the need for human intervention or assistance. This is achieved when the app detects a location in the trained data and informs the individual with a vocal/audio note.
The app is based on a widely-used image recognition model that can result in remarkable accuracy. It consists of a fully connected mixture of symmetric and asymmetric building blocks, max pooling, concatenates, and dropouts. It is built using the high-level Estimator API, which enforces separation of model and input portions of the code. It requires the user to define model_fn and_input_fn functions, the model and the input pipeline, and the preprocessing stages of TensorFlow, respectively. Following is a sample skeleton of these functions.
Before the model can be used to recognize images, it must be trained. This is usually done via supervised learning using a large set of labeled images. A blind individual usually makes and takes his same routes every day. They do this to avoid possible obstacles and as they get used to a route; they are aware of the surroundings and environment in that particular route. To test our theory, we decided a path and mapped out different and distinguishable locations or places that could be possible for a CNN to detect if mapped.
We then went around clicking around more than 60 images of one particular decided location. We used the feature termed as “shutter mode” to fast process as it ends up taken 4–5 images in one second. We then changed our location and clicked the same location from a different angle to get a different perspective of the view. This helps the CNN to make calculated judgment and predication to filter out possible locations we have stored in the neural net. We repeated this process for all locations that we had decided, collected all the images in one place and labelled them in folders. We then ran tensorflow on the collected images which uses the inception-v3-model to train and create a convolutional neural network.
All these are locations to places where you want to save the files. The retrain.py file then runs on the image dataset we have collected and writes and saves the CNN in the given location along with the required files. Mobile devices have limited amounts of memory, and apps need to be downloaded, so by default the Android/iOS version of TensorFlow only includes support for operations that are common in inference and don’t have large external dependencies. One of the operations not supported is DecodeJpeg, because the current implementation relies on libjpeg which is painful to support on Android/iOS and would increase the binary footprint. While we could write a new implementation for most mobile applications, we don’t need to decode JPEGs because we’re dealing directly with camera image buffers.
Unfortunately, the Inception model we based our retraining on includes a DecodeJpeg operation. We can bypass this by feeding the Mul node but on platforms that don’t support the operation you’ll see an error when the graph is loaded, even if the op is never called. To avoid this, the optimize_for_inference script removes all nodes that aren’t needed for a given set of input and output nodes. We do this using Bazel, which is a built tool just like cmake and make. The steps you listed is the correct way to get updates from master. The build step could take long the first time you build TensorFlow. Later builds, after updates from master, should be faster, as Bazel, just like any other build tool, doesn’t re-build targets whose dependencies have not been modified.
The script also does a few other optimizations that help speed, such as merging explicit batch normalization ops into the convolutional weights to reduce the number of calculations. Here’s how you run it:
This creates a new file at /location/optimized_graph.pb. The retrained model is still 87MB in size at this point, and that guarantees a large download size for any app that includes it. Because Apple distributes apps in .ipa packages, all of the assets are compressed using zip. Usually models don’t compress well because the weights are all slightly different floating point values. You can achieve much better compression just by rounding all the weights within a particular constant to 256 levels though, while still leaving them in floating-point format. This gives a lot more repetition for the compression algorithm to take advantage of, but doesn’t require any new operators and only reduces the precision by a small amount (typically less than a 1% drop in precision). Here’s how you call the quantize_graph script to apply these changes:
If you look on disk, the raw size of the rounded_graph.pb file is the same at 87MB, but if you right-click on it in the finder and choose “Compress”, you should see it results in a file that’s only about 24MB or so. That reflects what size increase you’d actually see in a compressed .ipa on iOS, or an .apk on Android. The final processing step we need to run is memory mapping. Because the buffers holding the model weight values are 87MB in size, the memory needed to load these into the app can put a lot of pressure on RAM in iOS even before the model is run. This can lead to stability problems as the OS can unpredictably kill apps that use too much memory. Fortunately these buffers are read-only, so it’s possible to map them into memory in a way that the OS can easily discard them behind the scenes when there’s memory pressure, avoiding the possibility of those crashes.
To support this, we need to rearrange the model so that the weights are held in sections that can be easily loaded separately from the main GraphDef, though they’re all still in one file. Here is the command to do that:
We now code the application on android studio using java and pass down the final “graph.pb” file. We then insert the labels in the strings.xml in the [android app location]/android/res/values/strings.xml]
This code runs in background and checks for an image confidence of greater than 60%, if it is true, it sends a notification/api call to google’s text to speech api to speak out the text as an audio note that set in the string vales.
Rather than going through the task of creating a CNN and using bazel to reduce its size and other features that are not required, we can use tensorflow lite to do it in a much simpler and fast manner. TensorFlow Lite is a lightweight and a next step from TensorFlow Mobile. You can do almost all the things that you do on TensorFlow mobile but much faster. It is a focused on mobile systems like Android, iOS etc. It enables on-device machine learning inference with low latency and a small binary size. It also supports hardware acceleration using Android Neural Network APIs. It supports a set of core operators, both quantized and float, which have been tuned especially for mobile platforms. They incorporate pre-fused activations and biases to further enhance performance and quantized accuracy. Additionally, it supports using custom operations in models. It defines a new model file format, based on FlatBuffers. FlatBuffers is an open-sourced, efficient cross platform serialization library. It is similar to protocol buffers, but the primary difference is that FlatBuffers does not need a parsing/unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. Also, the code footprint of FlatBuffers is an order of magnitude smaller than protocol buffers.
On top of this, it has a new mobile-optimized interpreter, which has the key goals of keeping apps lean and fast. The interpreter uses a static graph ordering and a custom (less-dynamic) memory allocator to ensure minimal load, initialization, and execution latency. TensorFlow Lite provides an interface to leverage hardware acceleration, if available on the device. It does so via the Android Neural Networks library, released as part of Android O-MR1. Although, it is currently at a technological preview state, hence not all TensorFlow features are currently supported, although it will be the reference for mobile and embedded devices in the near future which can be used to design and develop similar apps like these for the same.
In conclusion, MobileAI is an essential application that visually impaired, or even standard users can use to assist them to get to their destinations. It is a mobile application that uses Tensorflow for object detection and navigates according to the location it captures from videos or images without using online services. The app lets visually impaired users be their eyes in times of need.
MobileAI would be a beneficial developing technology to assist blind or visually impaired people with real-time navigation without requiring internet connections. As an essential future scope, it can use Google Street View images to train the model; as you may know, it allows access to worldwide street images. Another future scope is that it incorporates safety. The video captures violence or danger, and it chooses routes accordingly for the safest route. This GitHub repository, from a previous hackathon project, provides a potential opportunity to enhance for applicable usage: https://github.com/tedrand/wiki-violence-reporting-api.
 Varma, R et al, “Visual impairment and blindness in adults in the United States: Demographic and Geographic Variations from 2015 to 2050,” JAMA Ophthalmology, DOI:10.1001/jamaophthalmol.2016.1284.
 Szegedy, Christian, et al. “Rethinking the Inception Architecture for Computer Vision.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, doi:10.1109/cvpr.2016.308