Sparky Vision : HandsFree Empowering the Blind with AI Insight

Divya Chandana
The Deep Hub
Published in
9 min readApr 2, 2024

--

— Technical walkthrough

Published in Hackster.io

Part 1 — Introduction : https://medium.com/@divya.chandana/sparky-vision-handsfree-empowering-the-blind-with-ai-insight-956db25c9836

This blog is more technical steps involving setting up the container and running the Vision project

Hardware Components needed

  1. NVIDIA Jetson Nano Developer Kit
  2. PIR Motion Sensor (generic)
  3. Wifi module AC8265
  4. Logitech webcam
  5. 8MP High Resolution Web Camera
  6. Mini Speaker
  7. Jumper wires (generic)
  8. Legos
  9. Camera Mount Flexible
  10. Power bank
  11. Flash Memory Card, MicroSDHC Card

Setup Jetson Nano:

I have added more details in the documentation about the steps to setup the Nvidia nano https://www.hackster.io/divyachandana/sparkyvision-handsfree-empowering-the-blind-with-ai-insight-3dd450

Software components needed/ API:

For online mode, I used Google’s Gemini API which you required to create your own

For offline mode, this library will be used pytesseract

For online mode, for more human-like speech I used Google texttospeech

For offline mode, this library will be used pyttsx3

Don't worry I have added all these in the Docker file, you can you this project either Online or Offline mode based on your APIs availability

Let's get started

Project setup

Getting Started

To get started with Sparky, follow these steps:

1. Clone the Repository

Open your terminal and clone the Sparky repository:

bash git clone https://github.com/divyachandana/Sparky.git

2. Navigate to the Sparky Directory

Change the directory to the Sparky repository:

cd sparky

Run with Docker

To simplify setup and usage, Sparky can be run within a Docker container.

3.1. Build the Docker Container

Build the Sparky Docker container using the provided Dockerfile:

sudo docker build -t sparky .

3.2. Run the Docker Container

Run the Sparky Docker container with the following command:

This Docker command is running a container named “sparky” with

Nvidia GPU support (` — runtime nvidia`),

giving access to camera (` — device /dev/video0:/dev/video0`), and

audio devices (` — device /dev/snd:/dev/snd`).

It’s mapping port 8888 on the container to port 8888 on the host (`-p 8888:8888`), and mounting the local directory `/home/dc/Documents` to `/workspace` in the container (`-v /home/dc/Documents:/workspace`).

Additionally, it’s launching Jupyter Lab without authentication (` — allow-root — NotebookApp.token=’’ — NotebookApp.password=’’`).

sudo docker run --runtime nvidia -it --rm \
--privileged \
-p 8888:8888 \
--device /dev/video0:/dev/video0 \
--device /dev/snd:/dev/snd \
-v /home/dc/Documents:/workspace \
sparky \
jupyter lab --ip=0.0.0.0 --allow-root --NotebookApp.token='' --NotebookApp.password=''

This command sets up the container environment with necessary permissions and volume mappings and launches Jupyter Lab with Sparky’s functionality.

4. Access Jupyter Lab

Once the container is running, access Jupyter Lab in your web browser by navigating to http://localhost:8888.

5. Start Using Sparky

You’re now ready to start using Sparky

6. For headless mode:

  • Find IP Address:

To find the IP address of your Jetson Nano, open the terminal and run this command:

hostname -i
  • Power up Device:

Connect your Jetson Nano to a power bank to turn it on.

  • Access via SSH:

On another device, open the terminal.

Type in the username of your Jetson Nano followed by its local IP address:

ssh [username]@[host_ip_address]

When prompted, enter the password. This will let you remotely access your Jetson Nano.

Steps Involved

Step 1:

motion_sensor_detection function is set up to monitor a motion sensor connected to the GPIO pin 23 of the Jetson Nano. When motion is detected, it plays a greeting audio file to welcome the user and proceeds with the rest of the process.

# Step 1: Motion Sensor Detection and Greeting
def motion_sensor_detection():
SENSOR_PIN = 23 # Adjust pin according to your setup
GPIO.setmode(GPIO.BOARD) # Use physical pin numbering
GPIO.setup(SENSOR_PIN, GPIO.IN)
try:
while True:
if GPIO.input(SENSOR_PIN):
# Print greeting message when motion is detected
#print("Helloo, I'm Sparky. Your AI assistant. I'm here to assist you with your books, papers, research papers, images, graphs, pictures.")
play_audio('greetings.mp3')
time.sleep(11)
# Return True to indicate motion detection
return True
else:
# Print message when no motion is detected
print('No motion detected')

# Sleep to avoid continuous looping
time.sleep(2) # Adjust the sleep time as needed
finally:
# Clean up GPIO pins after use
GPIO.cleanup()

Step 2:

prompt_for_book function prompts if the placed object is book or not in front of the camera. Upon detection, an audio file is played to confirm the book’s presence.

# Step 2: Prompt for Book Detection
def prompt_for_book():
detected = False

# Keep prompting until a book is detected
while not detected:
# Capture image and check for book object
image_path = capture_image('detect_object.jpg')
detected = check_for_book_object(image_path)

# If book is detected, return image
if detected:
play_audio('book.mp3')
time.sleep(5)
return capture_image()
else:
# Prompt user to place a book in front of the camera
print("Please place a book or paper in front of the camera.")

# Sleep to allow time for placing the book
time.sleep(5) # Adjust the sleep time as necessary

Step 3:

This process loads a pre-trained SSD (Single Shot MultiBox Detector) MobileNet model, which is efficient for edge devices like the Jetson Nano, which is ready to detect objects in images(we need to detect book). The check_for_book_object function processes an input image: it resizes, normalizes, and converts it into a tensor suitable for the model.

The function checks the model’s predictions to see if a ‘book’ is recognized, indicated by the class ID (typically 84 in COCO dataset). If a book is detected, it returns True.

# Step 3: Check for Book Object
def check_for_book_object(image_path):
# Define transformation for the image
transform = T.Compose([
T.Resize(320),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Load and transform image
img = Image.open(image_path).convert("RGB")
img = transform(img).unsqueeze(0) # Add batch dimension
img = img.to(device)
# Detect objects in the image
with torch.no_grad():
prediction = model(img)
# Check if 'book' is among the detected classes (class ID for 'book' may vary)
labels = prediction[0]['labels'].tolist()
return any(label == 84 for label in labels) # 84 is often the class ID for 'book' in COCO dataset

Step 4:

capture_image function activates the first connected camera to take a high-resolution picture.

# Step 4: Capture Image
def capture_image(image_path='page.jpg'):
# Initialize the camera
cap = cv2.VideoCapture(0) # Assumes the first camera is the one you want to use

# Set the resolution (adjust the values based on your camera's supported resolution)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1920)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 1080)

# Warm-up time for the camera
time.sleep(2)
# Capture multiple frames to allow the camera to auto-adjust
for i in range(5):
ret, frame = cap.read()
if ret:
# Save the final frame
cv2.imwrite(image_path, frame)
# Release the camera
cap.release()
return image_path

Step 5:

The name suggests enhance_for_ocr make tweaks to the image so that it’s more viewable for OCR process. Rotates it 90 degrees clockwise to ensure the orientation matches the expected input for OCR processing, converts the image to grayscale, increasing brightness will improve the quality of the text present in the image

# Step 5: Enhance Image for OCR
def enhance_for_ocr(image_path, brightness_value=20):
# Read the image
image = cv2.imread(image_path)

# Rotate the image
rotated = cv2.rotate(image, cv2.ROTATE_90_CLOCKWISE)
# Convert to grayscale
gray_image = cv2.cvtColor(rotated, cv2.COLOR_BGR2GRAY)
# Create a numpy array with the same shape as the grayscale image
brightness_array = np.full(gray_image.shape, brightness_value, dtype=np.uint8)
# Increase brightness
brightened_image = cv2.add(gray_image, brightness_array)
# Save the image ready for OCR
corrected_image_path = 'corrected_image.jpg' # Define the path for the corrected image
cv2.imwrite(corrected_image_path, brightened_image)
time.sleep(1)

return brightened_image

Step 6:

This step converts it into text using OCR. Tesseract analyzes the image and returns the recognized text as a string.

# Step 6: Convert Image to Text using OCR
def image_to_text(image_array):
# Convert numpy array to PIL image
image = Image.fromarray(image_array)

# Use Tesseract to perform OCR on the image
text = pytesseract.image_to_string(image)

# Encode the text to UTF-8 and decode to ASCII
return text.encode('utf-8').decode('ascii', 'ignore')

Step 7:

This step converts an image file into a Base64 encoded string, for the image to text summary conversion using Google’s Gemini API

# Step 7: Convert Image to Base64 Encoding
def image_to_base64(image_path):
# Open the image file in binary mode and read its contents
with open(image_path, "rb") as image_file:
# Encode the binary data as base64 and decode it as a string
encoded_string = base64.b64encode(image_file.read()).decode()
return encoded_string

Step 8:

This step is to generate a summary of the text found in the image. This is done by sending a request to a Google API that utilizes the GEMINI model for content generation.

# Step 8: Summarize Text in Image
def summarize_text_image(base64_string):
google_api_key = "YOUR_GOOGLE_API_KEY_HERE" # Replace with your Google API key
url = f'https://generativelanguage.googleapis.com/v1beta/models/gemini-pro-vision:generateContent?key={google_api_key}'
# Prepare the JSON payload
payload = {
"contents":[
{
"parts":[
{"text": "what is in this image?"},
{
"inline_data": {
"mime_type":"image/jpeg",
"data": base64_string # Use the Base64 string obtained from Step 7
}
}
]
}
]
}
# Make the POST request
headers = {'Content-Type': 'application/json'}
response = requests.post(url, headers=headers, data=json.dumps(payload))
# Extract and return the response text
if response.status_code == 200:
result = response.json()
return result['candidates'][0]['content']['parts'][0]['text']
else:
return f"Error: {response.status_code}, {response.text}"

Step 9:

the function calls the Google Text-to-Speech API to generate speech from the combined text. The audio content received in response is saved to a file specified by summary_file_path

# Step 9: Convert Text to Speech
def text_to_speech(text, summary):
# Combine text and summary
full_text = f"{text} Summary for the page is {summary}"

# Initialize TextToSpeechClient
client = texttospeech.TextToSpeechClient.from_service_account_json(service_account_file)

# Set up synthesis input with the combined text
synthesis_input = texttospeech.SynthesisInput(text=full_text)

# Set voice selection parameters
voice = texttospeech.VoiceSelectionParams(language_code="en-US", ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL)

# Set audio configuration
audio_config = texttospeech.AudioConfig(audio_encoding=texttospeech.AudioEncoding.MP3)

# Synthesize speech
response = client.synthesize_speech(input=synthesis_input, voice=voice, audio_config=audio_config)

# Write audio content to file
with open(summary_file_path, "wb") as out:
out.write(response.audio_content)

# Print confirmation message
print(f"Audio content written to file {summary_file_path}")

Step 10:

play_audio function is designed to automatically play an audio file specified by the file_path parameter.

# Step 10: Play Audio
def play_audio(file_path):
# Display and play audio file
return display(Audio(file_path, autoplay=True))

Step 11:

pyttsx3 library works offline for text to speech conversion.

# Step 11: Offline Text-to-Speech
def offline_text_speak(text):
# Initialize text-to-speech engine
engine = pyttsx3.init() # object creation
# Set speech rate
rate = engine.getProperty('rate')
engine.setProperty('rate', 100)
# Set voice (0 for male, 1 for female)
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[1].id) # 1 for female
# Speak the text
engine.say(text)

# Run and wait for speech to finish
engine.runAndWait()

# Stop the engine
engine.stop()

Step 12 :

Check connectivity to ensure Sparky works online and offline

# Step 12: Check Internet Connection
def check_internet_connection():
try:
# Send a request to Google to check internet connectivity
response = requests.get('http://www.google.com', timeout=1)
# Return True if the response status code is 200 (OK)
return response.status_code == 200
except requests.ConnectionError:
# Return False if there's a connection error
return False

Final Step :

The Final Step integrates all the previous components. This step effectively leverages both online and offline resources to provide the user with accessible content

if motion_sensor_detection():
image_array = enhance_for_ocr(prompt_for_book())
play_audio('processing.mp3')
text = image_to_text(image_array)
if check_internet_connection():
print("Device has internet connection.")
summary = summarize_text_image(image_to_base64(corrected_image_path))
text_to_speech(text, summary) # Convert summary to speech
play_audio(summary_file_path)


else:
print("Device does not have internet connection.")
offline_text_speak(text)

Outputs

Original Image
Processed Image for OCR

Audio Output listen here: https://github.com/divyachandana/Sparky/blob/main/final_output_1/summary.mp3

More outputs

Original
Image for OCR

Audio output Listen here:

https://github.com/divyachandana/Sparky/blob/main/final_output_2/summary.mp3

Future plans for Project

The next version of Sparky introduces multilingual support for both OCR and text-to-speech functionalities. Enables users from diverse linguistic backgrounds to access and interact with printed materials in their native languages, broadening the impact and usability. And also add Voice assistant for two way communication and it will be easy for users to deep dive into complex information.

Code Repo

--

--