Machine Learning-based Navigation System for the Visually Impaired
The field of machine learning has evolved a lot during the last decades. Thanks to huge leaps in computing performance even mobile devices can perform object detection from a camera livestream at 30fps. One of the population groups who might benefit most from this, are in fact blind or visually impaired people. Visual intelligence can be used to compensate for their missing eyesight and recognize their surroundings for them.
In this article you’ll get to know how we (as a railway company) use machine learning and other cutting-edge technologies to help visually impaired persons on their journey.
In December 2020 we released our app “SBB Inclusive” (iOS, Android) aimed specifically at people with poor eyesight. The app detects in which train you are sitting or in which train station you are and provides you with all relevant information for your situation (obviously the app is heavily optimized for accessibility).
One of the pain points for visually impaired persons when using public transport is finding the door (or door button if the door is closed) to enter a train or bus. There are tactile guiding lines on the platform so that they can feel where the platform ends. But there is no way to know where the train doors are. Often they are forced to use their hands in order to “scan” the train for a door button. And although we’d say that our trains are quite clean, well, they are still operating outside. However the safety issues are far more important: imagine what could happen if the train suddenly starts moving, while a blind person is scanning it with her hands.
Since our team is quite experienced with image classification and object detection, we decided to take advantage of our knowledge to help the visually impaired. Our initial idea was:
- Collecting many pictures of trains at train stations and labelling doors, door buttons etc. Then train an object detection model, which can recognise these objects.
- Using the user’s mobile phone camera to feed the model in real-time.
- Use VoiceOver and vibrations to guide the user to the door or door button (similar to how an avalanche beacon works).
Since most visually impaired persons are iOS users we decided to start with a proof of concept for iOS and to catch up later for Android if our idea would be successful.
Collecting pictures and creating a model
Although the Covid-situation is really holding the entire world’s breath, we were actually able to use it to our advantage. Many employees weren’t able to proceed with their daily work routine, so we reached out and asked them to support us by taking pictures at train stations and labelling them at home. We used Microsoft’s CustomVision platform for the labelling process (thanks to Microsoft for letting us use their platform for free for this use case). We also used CustomVision to train a model which we then could use in our app. The CustomVision platform turned out to be a good choice because it was very easy to use for all our labellers.
In-app ObjectDetection in real-time
While the CustomVision platform was very convenient for creating a model, it turned out that the generated CoreML model was not really what we expected. CustomVision produces a CoreML v1 model (v3 is the current standard) and forces you to use the Microsoft CustomVisionMobile library. We also noticed that Microsoft developers are only humans as well: if you configure something wrong, the library will crash without providing you with the smallest hint on what went wrong (FatalError). A few nerve-wracking hours later we still managed to get it running and were able to test our model for the first time live in our app.
Although we were able to process 30 images per second on a modern iPhone, we noticed that it drained the battery a lot. So we decided to reduce the object detection rate down to around 1 cycle per second. The downside of this was — as you might imagine — that the bounding box was not updated smoothly while moving the camera around.
Luckily we found a solution for this and it is called object tracking. Object tracking is a technology that tracks frames of already detected objects over time. Under the hood object tracking also relies on machine learning but uses significantly less power than object detection.
We also had to play around to find the best settings here (specifically with the object detection rate), so we allowed our test-users to change it on the fly. Currently, the sweet-spot seems to lie with a rate of 1 frame per second.
Bonus: Measuring the distance
So far we had achieved our first goal: Knowing where the coach door button is. However we did still not know, how far away it is. To measure its distance we had two different options:
- Using the LiDAR sensor
✅ Very precise
❌ Only the newest and most expensive iPhone 12 Pro is equipped with it. It only measures the distance to a single point vertically in front of the sensor.
- Using DepthData
✅ Works on all newer iPhones (with at least two cameras). You get a “topographical” map of the entire camera content.
❌ Less precise (especially for points that are very far away).
Since in our case it’s not crucial to know, whether a door is 2.53 or 2.54m away, we decided to experiment with DepthData. If you are asking yourself, how you can calculate distances with the use of two different cameras, the secret lies in the different lenses (angles) the two cameras use. If you know, where an object lies in the two different images, you can calculate it’s distance. We won’t be going into to much detail here, but if you’re interested, this video will give you all the answers.
Now, DepthData will return a 2-dimensional matrix for every frame containing the measured distances. Since it’s not very convenient to work with a textual representation of those matrices for testing, we decided to visualize the measured distances directly in our camera stream. Pixels that are further away will be darker, while pixels closer to the camera will have a lighter graytone.
If you look closely at the above image, you’ll notice that the measured distances are not ultra-precise. For example, the reference point in the center is not 2 meters further away that the reference point in the top right corner. Also depth data seems to struggle with some special patterns like the haptic lines. Yet, if you don’t need the best precision, it works quite well. And the good news is: The nearer an object is, the better DepthData will work. Honestly, getting to explore DepthData was one of our personal highlights during the development process.
By now, we had all the relevant information at hand: We knew where the door (button) is and we could calculate its distance. There was still something relevant missing though. We still needed to navigate the visually impaired user to the door.
Navigating blind people
Until now, visually impaired persons could use the SBB Inclusive App with VoiceOver and/or using large content sizes. However, we felt that this would not be enough for navigating them. Luckily existing products from other areas provided plenty inspiration. One example for navigation of pedestrians are avalanche beacons, who navigate the user by beeping stronger and faster once you get closer. In the end, we decided to try out a solution using both VoiceOver and vibration intensity/speed (also called CoreHaptics in the iOS world):
- Detected objects are prioritized: Open door > door button > closed door.
- Highest priority object is read out by VoiceOver upon initial detection.
- Distance to the object is read out in VoiceOver in 3 different categories (more than 2m, more than 0.5m, less than 0.5m).
- The direction of the object (left/right and in theory also top/bottom) is “felt” by vibration patterns.
Will the presented solution end the pain of visually impaired people when trying to board a train or bus? We’re not yet able to give a definite answer. Currently, our test-users are testing this new feature for us and starting to give feedback. We’re really eager to hear, what they have to say.
On one hand we are very confident, that it is possible to navigate them to the door because the technical part of detecting objects and measuring distance seems to work well. On the other hand we are less certain, if our navigation pattern goes into production untouched as is.
Apart from the navigation pattern, there are still a few things to improve before this new feature can be released:
- Door button recognition does not work from far away yet (we are hoping that training our own YOLO-based model will improve this. First results by students of EPFL Lausanne obtained during our challenge at LauzHack seem to support this thesis).
- Door recognition can be improved for certain train types (this is really simple: we need more images of those train types).