Sparky Vision : HandsFree Empowering the Blind with AI Insight

Published in

The Deep Hub

9 min readApr 2, 2024

— Technical walkthrough

Part 1 — Introduction : https://medium.com/@divya.chandana/sparky-vision-handsfree-empowering-the-blind-with-ai-insight-956db25c9836

This blog is more technical steps involving setting up the container and running the Vision project

Hardware Components needed

NVIDIA Jetson Nano Developer Kit
PIR Motion Sensor (generic)
Wifi module AC8265
Logitech webcam
8MP High Resolution Web Camera
Mini Speaker
Jumper wires (generic)
Legos
Camera Mount Flexible
Power bank
Flash Memory Card, MicroSDHC Card

Setup Jetson Nano:

I have added more details in the documentation about the steps to setup the Nvidia nano https://www.hackster.io/divyachandana/sparkyvision-handsfree-empowering-the-blind-with-ai-insight-3dd450

Software components needed/ API:

For online mode, I used Google’s Gemini API which you required to create your own

For offline mode, this library will be used pytesseract

For online mode, for more human-like speech I used Google texttospeech

For offline mode, this library will be used pyttsx3

Don't worry I have added all these in the Docker file, you can you this project either Online or Offline mode based on your APIs availability

Let's get started

Project setup

Getting Started

To get started with Sparky, follow these steps:

1. Clone the Repository

Open your terminal and clone the Sparky repository:

bash git clone https://github.com/divyachandana/Sparky.git

2. Navigate to the Sparky Directory

Change the directory to the Sparky repository:

cd sparky

Run with Docker

To simplify setup and usage, Sparky can be run within a Docker container.

3.1. Build the Docker Container

Build the Sparky Docker container using the provided Dockerfile:

sudo docker build -t sparky .

3.2. Run the Docker Container

Run the Sparky Docker container with the following command:

This Docker command is running a container named “sparky” with

Nvidia GPU support (` — runtime nvidia`),

giving access to camera (` — device /dev/video0:/dev/video0`), and

audio devices (` — device /dev/snd:/dev/snd`).

It’s mapping port 8888 on the container to port 8888 on the host (`-p 8888:8888`), and mounting the local directory `/home/dc/Documents` to `/workspace` in the container (`-v /home/dc/Documents:/workspace`).

Additionally, it’s launching Jupyter Lab without authentication (` — allow-root — NotebookApp.token=’’ — NotebookApp.password=’’`).

sudo docker run --runtime nvidia -it --rm \
--privileged \
-p 8888:8888 \
--device /dev/video0:/dev/video0 \
--device /dev/snd:/dev/snd \
-v /home/dc/Documents:/workspace \
sparky \
jupyter lab --ip=0.0.0.0 --allow-root --NotebookApp.token='' --NotebookApp.password=''

This command sets up the container environment with necessary permissions and volume mappings and launches Jupyter Lab with Sparky’s functionality.

4. Access Jupyter Lab

Once the container is running, access Jupyter Lab in your web browser by navigating to http://localhost:8888.

5. Start Using Sparky

You’re now ready to start using Sparky

6. For headless mode:

Find IP Address:

To find the IP address of your Jetson Nano, open the terminal and run this command:

hostname -i

Power up Device:

Connect your Jetson Nano to a power bank to turn it on.

Access via SSH:

On another device, open the terminal.

Type in the username of your Jetson Nano followed by its local IP address:

ssh [username]@[host_ip_address]

When prompted, enter the password. This will let you remotely access your Jetson Nano.

Steps Involved

Step 1:

motion_sensor_detection function is set up to monitor a motion sensor connected to the GPIO pin 23 of the Jetson Nano. When motion is detected, it plays a greeting audio file to welcome the user and proceeds with the rest of the process.

# Step 1: Motion Sensor Detection and Greeting
def motion_sensor_detection():
    SENSOR_PIN = 23  # Adjust pin according to your setup
    GPIO.setmode(GPIO.BOARD)  # Use physical pin numbering
    GPIO.setup(SENSOR_PIN, GPIO.IN)
    try:
        while True:
            if GPIO.input(SENSOR_PIN):
                # Print greeting message when motion is detected
                #print("Helloo, I'm Sparky. Your AI assistant. I'm here to assist you with your books, papers, research papers, images, graphs, pictures.")
                play_audio('greetings.mp3')
                time.sleep(11)
                # Return True to indicate motion detection
                return True
            else:
                # Print message when no motion is detected
                print('No motion detected')
                
            # Sleep to avoid continuous looping
            time.sleep(2)  # Adjust the sleep time as needed
    finally:
        # Clean up GPIO pins after use
        GPIO.cleanup()

Step 2:

prompt_for_book function prompts if the placed object is book or not in front of the camera. Upon detection, an audio file is played to confirm the book’s presence.

# Step 2: Prompt for Book Detection
def prompt_for_book():
    detected = False
    
    # Keep prompting until a book is detected
    while not detected:
        # Capture image and check for book object
        image_path = capture_image('detect_object.jpg')
        detected = check_for_book_object(image_path)
        
        # If book is detected, return image
        if detected:
            play_audio('book.mp3')
            time.sleep(5)
            return capture_image()
        else:
            # Prompt user to place a book in front of the camera
            print("Please place a book or paper in front of the camera.")
            
            # Sleep to allow time for placing the book
            time.sleep(5)  # Adjust the sleep time as necessary

Step 3:

This process loads a pre-trained SSD (Single Shot MultiBox Detector) MobileNet model, which is efficient for edge devices like the Jetson Nano, which is ready to detect objects in images(we need to detect book). The check_for_book_object function processes an input image: it resizes, normalizes, and converts it into a tensor suitable for the model.

The function checks the model’s predictions to see if a ‘book’ is recognized, indicated by the class ID (typically 84 in COCO dataset). If a book is detected, it returns True.

# Step 3: Check for Book Object
def check_for_book_object(image_path):
    # Define transformation for the image
    transform = T.Compose([
        T.Resize(320),
        T.ToTensor(), 
        T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    # Load and transform image
    img = Image.open(image_path).convert("RGB")
    img = transform(img).unsqueeze(0)  # Add batch dimension
    img = img.to(device) 
    # Detect objects in the image
    with torch.no_grad():
        prediction = model(img)
    # Check if 'book' is among the detected classes (class ID for 'book' may vary)
    labels = prediction[0]['labels'].tolist()
    return any(label == 84 for label in labels)  # 84 is often the class ID for 'book' in COCO dataset

Step 4:

capture_image function activates the first connected camera to take a high-resolution picture.

# Step 4: Capture Image
def capture_image(image_path='page.jpg'):
    # Initialize the camera
    cap = cv2.VideoCapture(0)  # Assumes the first camera is the one you want to use
    
    # Set the resolution (adjust the values based on your camera's supported resolution)
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1920)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 1080)

    # Warm-up time for the camera
    time.sleep(2)
    # Capture multiple frames to allow the camera to auto-adjust
    for i in range(5):
        ret, frame = cap.read()
    if ret:
        # Save the final frame
        cv2.imwrite(image_path, frame)
    # Release the camera
    cap.release()
    return image_path

Step 5:

The name suggests enhance_for_ocr make tweaks to the image so that it’s more viewable for OCR process. Rotates it 90 degrees clockwise to ensure the orientation matches the expected input for OCR processing, converts the image to grayscale, increasing brightness will improve the quality of the text present in the image

# Step 5: Enhance Image for OCR
def enhance_for_ocr(image_path, brightness_value=20):
    # Read the image
    image = cv2.imread(image_path)
    
    # Rotate the image
    rotated = cv2.rotate(image, cv2.ROTATE_90_CLOCKWISE)
    # Convert to grayscale
    gray_image = cv2.cvtColor(rotated, cv2.COLOR_BGR2GRAY)
    # Create a numpy array with the same shape as the grayscale image
    brightness_array = np.full(gray_image.shape, brightness_value, dtype=np.uint8)
    # Increase brightness
    brightened_image = cv2.add(gray_image, brightness_array)
    # Save the image ready for OCR
    corrected_image_path = 'corrected_image.jpg'  # Define the path for the corrected image
    cv2.imwrite(corrected_image_path, brightened_image)
    time.sleep(1)
    
    return brightened_image

Step 6:

This step converts it into text using OCR. Tesseract analyzes the image and returns the recognized text as a string.

# Step 6: Convert Image to Text using OCR
def image_to_text(image_array):
    # Convert numpy array to PIL image
    image = Image.fromarray(image_array)
    
    # Use Tesseract to perform OCR on the image
    text = pytesseract.image_to_string(image)
    
    # Encode the text to UTF-8 and decode to ASCII
    return text.encode('utf-8').decode('ascii', 'ignore')

Step 7:

This step converts an image file into a Base64 encoded string, for the image to text summary conversion using Google’s Gemini API

# Step 7: Convert Image to Base64 Encoding
def image_to_base64(image_path):
    # Open the image file in binary mode and read its contents
    with open(image_path, "rb") as image_file:
        # Encode the binary data as base64 and decode it as a string
        encoded_string = base64.b64encode(image_file.read()).decode()
    return encoded_string

Step 8:

This step is to generate a summary of the text found in the image. This is done by sending a request to a Google API that utilizes the GEMINI model for content generation.

# Step 8: Summarize Text in Image
def summarize_text_image(base64_string):
    google_api_key = "YOUR_GOOGLE_API_KEY_HERE"  # Replace with your Google API key
    url = f'https://generativelanguage.googleapis.com/v1beta/models/gemini-pro-vision:generateContent?key={google_api_key}'
    # Prepare the JSON payload
    payload = {
      "contents":[
        {
          "parts":[
            {"text": "what is in this image?"},
            {
              "inline_data": {
                "mime_type":"image/jpeg",
                "data": base64_string  # Use the Base64 string obtained from Step 7
              }
            }
          ]
        }
      ]
    }
    # Make the POST request
    headers = {'Content-Type': 'application/json'}
    response = requests.post(url, headers=headers, data=json.dumps(payload))
    # Extract and return the response text
    if response.status_code == 200:
        result = response.json()
        return result['candidates'][0]['content']['parts'][0]['text']
    else:
        return f"Error: {response.status_code}, {response.text}"

Step 9:

the function calls the Google Text-to-Speech API to generate speech from the combined text. The audio content received in response is saved to a file specified by summary_file_path

# Step 9: Convert Text to Speech
def text_to_speech(text, summary):
    # Combine text and summary
    full_text = f"{text} Summary for the page is {summary}"
    
    # Initialize TextToSpeechClient
    client = texttospeech.TextToSpeechClient.from_service_account_json(service_account_file)
    
    # Set up synthesis input with the combined text
    synthesis_input = texttospeech.SynthesisInput(text=full_text)
    
    # Set voice selection parameters
    voice = texttospeech.VoiceSelectionParams(language_code="en-US", ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL)
    
    # Set audio configuration
    audio_config = texttospeech.AudioConfig(audio_encoding=texttospeech.AudioEncoding.MP3)
    
    # Synthesize speech
    response = client.synthesize_speech(input=synthesis_input, voice=voice, audio_config=audio_config)
    
    # Write audio content to file
    with open(summary_file_path, "wb") as out:
        out.write(response.audio_content)
        
    # Print confirmation message
    print(f"Audio content written to file {summary_file_path}")

Step 10:

play_audio function is designed to automatically play an audio file specified by the file_path parameter.

# Step 10: Play Audio
def play_audio(file_path):
    # Display and play audio file
    return display(Audio(file_path, autoplay=True))

Step 11:

pyttsx3 library works offline for text to speech conversion.

# Step 11: Offline Text-to-Speech
def offline_text_speak(text):
    # Initialize text-to-speech engine
    engine = pyttsx3.init() # object creation
    # Set speech rate
    rate = engine.getProperty('rate')   
    engine.setProperty('rate', 100)     
    # Set voice (0 for male, 1 for female)
    voices = engine.getProperty('voices')       
    engine.setProperty('voice', voices[1].id)   # 1 for female
    # Speak the text
    engine.say(text)
    
    # Run and wait for speech to finish
    engine.runAndWait()
    
    # Stop the engine
    engine.stop()

Step 12 :

Check connectivity to ensure Sparky works online and offline

# Step 12: Check Internet Connection
def check_internet_connection():
    try:
        # Send a request to Google to check internet connectivity
        response = requests.get('http://www.google.com', timeout=1)
        # Return True if the response status code is 200 (OK)
        return response.status_code == 200
    except requests.ConnectionError:
        # Return False if there's a connection error
        return False

Final Step :

The Final Step integrates all the previous components. This step effectively leverages both online and offline resources to provide the user with accessible content

if motion_sensor_detection():
    image_array = enhance_for_ocr(prompt_for_book())
    play_audio('processing.mp3')
    text = image_to_text(image_array)
    if check_internet_connection():
        print("Device has internet connection.")
        summary = summarize_text_image(image_to_base64(corrected_image_path))  
        text_to_speech(text, summary)  # Convert summary to speech
        play_audio(summary_file_path)
        
        
    else:
        print("Device does not have internet connection.")
        offline_text_speak(text)

Outputs

Audio Output listen here: https://github.com/divyachandana/Sparky/blob/main/final_output_1/summary.mp3

More outputs

Audio output Listen here:

https://github.com/divyachandana/Sparky/blob/main/final_output_2/summary.mp3

Future plans for Project

The next version of Sparky introduces multilingual support for both OCR and text-to-speech functionalities. Enables users from diverse linguistic backgrounds to access and interact with printed materials in their native languages, broadening the impact and usability. And also add Voice assistant for two way communication and it will be easy for users to deep dive into complex information.

Code Repo

GitHub - divyachandana/Sparky: Sparky the AI assistant

Sparky the AI assistant. Contribute to divyachandana/Sparky development by creating an account on GitHub.

github.com