Sparky Vision : HandsFree Empowering the Blind with AI Insight

4 min readApr 2, 2024

Published project in Hackster.io

Part 2 — For Technical deep dive : https://medium.com/thedeephub/sparky-vision-handsfree-empowering-the-blind-with-ai-insight-08fd659db168

Introduction

Welcome to my new project Sparky Vision, an assistive AI technology that opens up a new world of possibilities for individuals with visual impairments. This document introduces Sparky Vision, a cutting-edge tool that transforms visual information into audible content. With its advanced features, Sparky Vision makes books, graphs, images, and research papers more accessible than ever before.

Accessibility Features

Sparky Vision is designed to be user-friendly for all, especially for those who have visual impairments. It starts by welcoming users with a friendly audio greeting as they approach, thanks to its motion sensor. As it works, Sparky Vision keeps users informed with audio updates — like when a book is detected or content is being processed — so they always know what’s happening. We’ve also thought about how someone would place a book; that’s why there’s a special spot marked with raised ‘Lego’ bumps. By feeling for this shape, users can position their materials perfectly every time, making the experience smooth and predictable.

Project Demo

Video Demo by Author

Technology Overview

Nvidia Jetson Nano: Powers the AI, providing edge computing capabilities for real-time processing.

Docker Container: Utilizes an Nvidia PyTorch base image to create a reliable and reproducible environment for running Sparky Vision.

Motion Detection: Initiates interaction, waking Sparky Vision when a user is near using sensitive motion sensors.

Object Detection: Implemented using the ssdlite320_mobilenet_v3_large model, fine-tuned to detect books with high accuracy.

Image Capture & Preprocessing: Handles automatic acquisition of visual content and prepares it for analysis.

OCR (Optical Character Recognition): Tesseract-OCR translates images of text to machine-encoded text for processing.

Summarization: Employs Google’s GEMINI model to generate concise summaries from the detected text when online.

Text-to-Speech: Google Cloud Text-to-Speech provides natural-sounding audio online, while pyttsx3 serves as the offline voice response system.

Python Libraries: Incorporates libraries such as pytesseract, numpy, pandas, and more for various functionalities from image processing to data manipulation.

Internet Connectivity: Checks for an internet connection to decide between online summarization or offline text conversion.

Auditory Feedback: Gives users step-by-step voice updates during the process, enhancing the understanding of Sparky Vision’s status.

User Interaction Design: Includes physical ‘Lego-like’ reference points to guide users in correctly placing books for scanning.

My Problems and Resolutions

For better text recognition, upgraded from a low-quality Logitech C270 webcam to a higher-resolution camera.
Encountered detection issues with the Raspberry Pi camera; it wasn’t recognized by the device.
Faced version compatibility problems with PyTorch; resolved by using a specific Nvidia Docker image for PyTorch.
Encountered multiple installation dependencies with Python libraries; streamlined the process with a Dockerfile.
Experienced issues with CUDA device recognition on an old SD card; re-flashing a new SD card solved the problem.
Had trouble with client library inconsistencies; switched to using API services as a workaround.

Step up Jetson Nano

I have added more detailed steps in this documentation https://www.hackster.io/divyachandana/sparkyvision-handsfree-empowering-the-blind-with-ai-insight-3dd450

Sparky Flow Diagram

Device Initialization: The process begins with the Nvidia Jetson Nano, powered by a Docker container which provides a consistent and isolated environment for running the AI models.

Detection Phase:

Motion Detection: The system first detects if there is motion in front of the camera, indicating a user is present.
Object Detection: If motion is detected, the system then identifies whether the object in front of the camera is a book (or possibly other materials like graphs, images, or research papers).

Content Processing:

Capture Content: If a book is confirmed, the system captures the content of the page.
Preprocess Image: The captured image goes through preprocessing to enhance it for better text recognition.

Connectivity Check:

Internet Connectivity: The system checks if there is an internet connection available.

OCR and Output:

If internet is available:

OCR to Summary: The image is converted to text, and then a summary is generated.

If internet is not available:

OCR to Text: The system still converts the image to text but skips the summarization step.
Text to Speech: Finally, the text (or summary, if internet was available) is converted into speech output for the user to hear.

To be continued..

I will add steps for project setup and run in detail in the following blog

Part 2 — Technical deep dive : https://medium.com/thedeephub/sparky-vision-handsfree-empowering-the-blind-with-ai-insight-08fd659db168