Virtual Eyes — A simple surrounding describing app for the visually impaired

Nirmal Mendis
Geek Culture
Published in
12 min readAug 31, 2021

In the past few years, Image captioning has created a tremendous impact on the field of machine learning with new applications emerging day by day. Image Captioning is the ability to use Artificial Intelligence, Computer Vision, and Machine Learning to generate descriptions for images similar to how a human would describe the contents of an image. In this article, I will be discussing about how I created a simple app using Image captioning to describe the surroundings to a visually impaired individual.

To provide an overview of this app, it uses the main camera of the phone to capture photos when the user gives a voice command, and using Azure Computer Vision services, the app generates captions as well as location data of objects and speaks to the user in natural language.

Move to the end of this article if you are eager to watch a demo of this application.

Development of the Android App

To develop this application I used Unity along with Azure Computer Vision services. So, let’s see how it was done.

  1. Creating the Azure Computer Vision Resource.

I have done an entire article regarding how to create an Azure Computer Vision resource and use its services in Unity here. You can refer to it to get to know more about setting up Azure since I will not be going into detail regarding this.

2. Creating the Unity Android Application

I started by creating a Unity 2D project and I had to change a few configurations to suit this project as given below:

1. Change Platform to Android

(File -> Build Settings -> Android -> Switch Platform)

2. Import Newtonsoft DLL file and setup csc.rsp file in Assets.

3. Change API compatibility Level

Edit -> Project Settings ->Player -> Other Settings -> API Compatibility Level from ‘.NET Standard 2.0’ to ‘.NET 4.x’

All the above tasks have been explained (with reasons) in my article on Image Captioning in Unity Android using Azure Computer Vision.

4. Setting Up Text to Speech and Speech to Text

To perform this task, I have used the plugin provided by j1mmyto9 on his Github repository. This tutorial explains how to use the above plugin in your unity project.

Here’s how I did it:

  1. Download the Github repository and extract the files.
  2. Open the ‘SpeechToText_AppleAPI/Assets/’ folder and copy the folders ‘Plugins’ and ‘SpeechAndText’ into your Assets folder in the Unity project.
  3. Create 2 empty gameobjects with the name ‘SpeechToText’ and ‘TextToSpeech’ and add the C# scripts named ‘SpeechToText.cs’ and ‘TextToSpeech.cs’ as components to the above gameobjects respectively.

Note: These names mentioned above should be exactly as specified.

Also, uncheck the ‘IsShowPopupAndroid’ option of the SpeechToText script in the inspector.

Here’s how it looked like:

4. Create an empty game object and name it as ‘SpeechHandler’ (this name can differ).

5. Create a C# script as ‘SpeechHandler.cs’ and add it as a component to the ‘SpeechHandler’ gameobject created above. This script will contain the code to handle Speech related functions.

5. Create a C# script to manage functions

Create a C# script named ‘caption.cs’ to manage all other functions including API calls. Next, create an empty gameobject named ‘captionHandler’ and add the ‘caption.cs’ script to the above gameobject.

5. Create the UI components in Unity

Create a panel, and under the panel, create a RawImage (set z rotation to -90) and a Text gameobject as children of the Panel name them as ‘CameraFeed’ and ‘SpeechStatus’ respectively. By the way, I changed the aspect ratio to 2160x1080 Portrait from the game window. You might also want to resize the UI elements.

Here’s how my setup looked like:

Now the setting up process is complete. Next is the coding task.

First, I’ll discuss about the ‘caption.cs’ script.

caption.cs

This script contains the following member variables:

// Add your Computer Vision subscription key and endpoint
static string subscriptionKey = "PASTE_YOUR_COMPUTER_VISION_SUBSCRIPTION_KEY_HERE";
//azure endpoint
static string endpoint = "PASTE_YOUR_COMPUTER_VISION_ENDPOINT_HERE";
//azure endpoint service accessed
static string captionBase = "vision/v3.2/describe?"; //endpoint to generate image captions
static string objectBase = "vision/v3.2/analyze?"; //endpoint to get object locations//UI Components
[SerializeField] private RawImage cameraFeed;
//webcamtexture
private WebCamTexture webcamTexture;

The above variables are for the Azure subscription key, Azure API endpoint, the captionBase and objectBase which will be attached to the endpoint, a RawImage UI component, and a webcamtexture.

The captionBase and objectBase are the features I request from azure. The ‘describe’ feature includes the image caption whereas the ‘analyze’ feature includes the position of the detected objects with respect to the image.

In this project, I have chosen to follow a singleton design pattern, therefore I have created an instance of the class itself as below.

//instance of caption class
private static caption instance;
public static caption Instance
{
get
{
if (instance == null)
{
instance = FindObjectOfType<caption>();
}
return instance;
}
}

Next, it is needed to check for the user permissions when the application starts. This can be done from the below method.

void CheckPermissin()
{
if (!Permission.HasUserAuthorizedPermission(Permission.Camera))
{
Permission.RequestUserPermission(Permission.Camera);
}
if (!Permission.HasUserAuthorizedPermission(Permission.Microphone))
{
Permission.RequestUserPermission(Permission.Microphone);
}
if(Permission.HasUserAuthorizedPermission(Permission.Camera) && Permission.HasUserAuthorizedPermission(Permission.Microphone))
{
//start camera feed if both permissions granted
startCamera();
}
}

This method checks whether the user has granted Microphone and Camera permissions and if not, asks the user for the permission. If both permissions are granted, the method calls the ‘startCamera’ method which starts the camera and assigns the camera feed to the ‘CameraFeed’ RawImage. Obviously, this is of no use for the visually impaired but I have implemented this for our convenience.

void startCamera()
{
//get all camera devices
WebCamDevice[] cam_devices = WebCamTexture.devices;
//Set a camera to the webcamTexture
webcamTexture = new WebCamTexture(cam_devices[0].name, 480, 640, 30);
//Set the webcamTexture to the texture of the rawimage
cameraFeed.texture = webcamTexture;
cameraFeed.material.mainTexture = webcamTexture;
//Start the camera
webcamTexture.Play();
}

The next problem which arose was where should I call the ‘CheckPermission’ method. Since permission checks are done asynchronously, the app starts to run without waiting for the user permissions thereby resulting in a white screen on the ‘cameraFeed’ RawImage rather than rendering the camera feed. Therefore, calling this method in the ‘start’ method leads to such errors. Therefore, I entered the call to the ‘CheckPermission’ method in the ‘OnApplicationFocus’ call back which is run each time the application comes back to focus, thus it will check for all permissions repeatedly until all permissions are granted. Given below is the code to do this.

private void OnApplicationFocus(bool focus)
{
//check for permission
CheckPermissin();
}

Next, I created an IEnumerator method named ‘SaveImage’ to capture an image from the camera feed and pass the image byte data to another method that handles the API calls. This method is an IEnumerator because it uses ‘yield return’ to wait for the end of the current frame. Also, you’ll notice a call to the ‘RotateTexture’ method which is used to rotate a Texture2D by a given angle. This method can be obtained from here.

The SaveImage() method also accepts a boolean argument ‘isDescribe’ which will be true if the user wants an image caption and will be false if the user wants position data.

Given below is this method:

public IEnumerator SaveImage(bool isDescribe)
{
//Create a Texture2D with the size of the rendered image on the screen.
Texture2D texture = new Texture2D(webcamTexture.width, webcamTexture.height, TextureFormat.ARGB32, false);
//wait till end of frame
yield return new WaitForEndOfFrame();
//save webcam frame to texture
texture.SetPixels(webcamTexture.GetPixels());
//rotate texture
texture = RotateTexture(texture, -90);
texture.Apply();
//check user requirement whether caption or object position
if (isDescribe)
{
getCaption(texture.EncodeToPNG());
}
else
{
getObjects(texture.EncodeToPNG());
}
}

Next, I will discuss about the ‘getCaption’ and ‘getObjects’ methods which have been called from the ‘SaveImage’ method.

The ‘getCaption’ method contains code to create the final uriBase along with the necessary request parameters. This method then passes this data along with the image byte array to the ‘MakeRequest’ method which calls the API and returns the data. Next, this returned data needs to be processed to extract the caption which is done by calling the ‘convertCaption’ method and then passed to the SpeechHandler (which will be implemented shortly) to speak out the caption.

public async void getCaption(byte[] imageBytes)
{
//uri
string uriBase = endpoint + captionBase;
// Request parameters
var requestParameters = HttpUtility.ParseQueryString(string.Empty);
requestParameters["maxCandidates"] = "1";
requestParameters["language"] = "en";
requestParameters["model-version"] = "latest";
// call makeRequest method to make API call
String result = await MakeRequest(uriBase, requestParameters, imageBytes);
//extract caption
string convResult = convertCaption(result);
//speak caption
SpeechHandler.Instance.StartSpeaking(convResult);
}

Similar to the above ‘getCaptions’ method is the ‘getObjects’ method which has its own uriBase and request parameters along with a call to the ‘MakeRequest’ method, and to the ‘convertObjects’ method which extracts the position data and creates a sentence and returns it back and finally calls the SpeechHandler to speak out the sentence with position data.

public async void getObjects(byte[] imageBytes)
{
// uri
string uriBase = endpoint + objectBase;
// Request parameters
var requestParameters = HttpUtility.ParseQueryString(string.Empty);
requestParameters["visualFeatures"] = "Objects";
requestParameters["language"] = "en";
requestParameters["model-version"] = "latest";
// call makeRequest method to make API call
String result = await MakeRequest(uriBase, requestParameters, imageBytes);
//extract position sentence
string convResult = convertObjects(result);
//speak
SpeechHandler.Instance.StartSpeaking(convResult);
}

Next is the ‘MakeRequest’ method which includes the call to the Azure API and returns the generated response. To know more about this method visit here. Here’s the ‘MakeRequest’ method:

async Task<String> MakeRequest(string uriBase, NameValueCollection requestParameters, byte[] byteData)
{
//initialize variable for result
String responseText = "";
try
{
HttpClient client = new HttpClient();
// Request headers
client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey);
// Assemble the URI for the REST API method.
string uri = uriBase + requestParameters;
HttpResponseMessage response; // Request body
using (var content = new ByteArrayContent(byteData))
{
content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
// Asynchronously call the REST API method.
response = await client.PostAsync(uri, content);
// Asynchronously get the JSON response.
responseText = await response.Content.ReadAsStringAsync();
}
}
catch
{
responseText = "";
}
return responseText;
}

Now, let’s discuss about the ‘convertCaption’ and ‘convertObjects’ methods which we called earlier.

The ‘convertCaption’ method basically takes in the response string from the API, converts it into a JSON Object, and extracts another JSON object within it which contains captions. This captions object contains an array of captions, therefore first, it needs to be parsed into a JArray and then the first caption is extracted. Here’s how it is done:

public string convertCaption(string responseText)
{
//initialize variable for caption
string textCaption = "";

try
{
//convert result to a dictionary
var jsonResult = JsonConvert.DeserializeObject<Dictionary<string, dynamic>>(responseText);
//obtain captions object from jsonResult
var captionsObj = jsonResult["description"]["captions"];
//convert captionsObj to JArray
JArray captionArray = JArray.Parse(captionsObj.ToString());
//get caption string from array
textCaption = captionArray[0]["text"].ToString();
}
catch
{
textCaption = "Couldn't get description, Please try again !";
}
return textCaption;
}

The ‘convertObjects’ method also takes the response string from the API, converts it into an object by deserializing, and extracts the ‘objects’ JSON object within. Then this object is converted to a JArray to access the position and sizes of each detected object in the image. This method also includes a call to the ‘position’ method which calculates whether an object is in the left, right, or middle section of the image. Using this information, 3 lists named as left, right, and front are created and the relevant objects are appended to the lists.

Finally, a sentence is created for the app to speak out where the objects are located.

Here’s how it is done:

public string convertObjects(string responseText)
{
//initialize string to store final sentence
string objectsText = "";
try
{
//convert result to dictionary
var jsonResult = JsonConvert.DeserializeObject<Dictionary<string, dynamic>>(responseText);
//obtain objects
var objects = jsonResult["objects"];
//obtain width of image
var width = jsonResult["metadata"]["width"];
//convert width to double
double imageWidth = double.Parse(width.ToString());
//convert objects to JArray
JArray objsArray = JArray.Parse(objects.ToString());
//initialize 3 lists to store objects in left, front, right
List<string> left = new List<string>();
List<string> front = new List<string>();
List<string> right = new List<string>();
//for each object in array
foreach (var item in objsArray)
{
//get x value
double x = double.Parse(item["rectangle"]["x"].ToString());
//get width of object detected
double w = double.Parse(item["rectangle"]["w"].ToString());
//check where the object is located
if (position(imageWidth, x, w) == "left")
{
left.Add(item["object"].ToString());
}
if (position(imageWidth, x, w) == "front")
{
front.Add(item["object"].ToString());
}
if (position(imageWidth, x, w) == "right")
{
right.Add(item["object"].ToString());
}
}
//if lists are not empty, create sentence
if (!(left.Count == 0))
{
objectsText = string.Join(",", left) + " to your left, ";
}
if (!(front.Count == 0))
{
objectsText = objectsText + string.Join(",", front) + " in your front, ";
}
if (!(right.Count == 0))
{
objectsText = objectsText + "and " + string.Join(",", right) + " to your right";
}
}
catch
{
objectsText = "Couldn't get position data, Please try again !";
}
return objectsText;
}

Next, we have the ‘position’ method. As described earlier, this method is used to calculate in which section a particular object is located with respect to the image. The method accepts three arguments which are the width of the entire image, the starting x-axis position of the object, and the width of the object.

The image is divided into 3 sections as left, front, and right. The width of the object is divided by half and added to the x value to obtain the x value of the midpoint of the object. Using this midpoint, it is possible to determine in which section of the image is the object located. This is done using the below code:

public string position(double width, double x, double w)
{
//divide image to 3 sections vertically and obtain length of one section
double oneSection = width / 3;
//left section will end from oneSection length
double left = oneSection;
//front section end from 2 * oneSection length
double front = oneSection * 2;
//check if object's middle (x+w/2) is in which section
if ((x + w / 2.0) < left)
{
return "left";
}
else if ((x + w / 2.0) <= front)
{
return "front";
}
else
{
return "right";
}
}

SpeechHandler.cs

Next, I’ll discuss about the SpeechHandler.cs script which handles the Text to Speech and Speech to Text functionality.

This class contains variables for a UI Text component, Language code, and a boolean variable ‘isListening’. The UI Text component is used to show whether the app is Listening to voice or not. This is only for our convenience as it won’t be useful for the visually impaired. The ‘isListening’ variable is used to keep track of when the application isn’t listening to voice and if not, the update method will check this and start the listening process. Here are the above-mentioned variables:

//UI text
[SerializeField] private Text speechStatus;
//language code constant
const string LANG_CODE = "en-US";
//variable to know if app is listening to voice
bool isListening = true;

As mentioned earlier, this project follows a singleton design pattern, and therefore, I have created an instance of this class as follows:

private static SpeechHandler instance;public static SpeechHandler Instance
{
get
{
if (instance == null)
{
instance = FindObjectOfType<SpeechHandler>();
}
return instance;
}
}

The ‘Setup’ method which configures the TextToSpeech and SpeechToText settings set up the language, pitch, and rate of the speaker and receiver.

void Setup(string code)
{
TextToSpeech.instance.Setting(code, 1, 0.8f);
SpeechToText.instance.Setting(code);
}

Next, a method named ‘StartSpeaking’ is defined to start the app speaking a message.

public void StartSpeaking(string message)
{
TextToSpeech.instance.StartSpeak(message);
}

Next, a call-back function is defined to handle operations when speaking is stopped. Here, I set the ‘isListening’ to false so that the update method will call ‘StartListening’ from the next frame.

public void OnSpeakStop()
{
//set isListening to False
isListening = false;
}

Given below is the ‘StartListening’ function. Here the UI Text component is updated as well.

public void StartListening()
{
SpeechToText.instance.StartRecording();
speechStatus.text = "Listening...";
}

Similarly, the StopListening method stops listening to the voice and updates the UI Text component.

public void StopListening()
{
SpeechToText.instance.StopRecording();
speechStatus.text = "Stopped Listening";
}

Next is the ‘OnFinalSpeechResult’ method which handles calls to the caption class instance based on the recognized keyword. If ‘placement’ is recognized in the user voice input, the ‘SaveImage’ coroutine is called with the argument set to false (implying the user doesn’t need a description but needs position data). The opposite happens if the “caption” keyword is detected in the user input. If none is detected, the ‘isListening’ variable is set to false. Given below is the ‘OnFinalSpeechResult’ method.

void OnFinalSpeechResult(string result)
{
speechStatus.text = result;
try
{
//check if 'position' or 'desribe' words in spoken sentence by user
if (result.ToLower().Contains("placement"))
{
//set is Listening to true because otherwise, app will listen to caption spoken by itself from update method
isListening = true;
//stoplistening
StopListening();
//call SaveImage
StartCoroutine(caption.Instance.SaveImage(false));
}
else if (result.ToLower().Contains("caption"))
{
//set is Listening to true because otherwise, app will listen to position sentence spoken by itself from update method
isListening = true;
//stoplistening
StopListening();
//call SaveImage
StartCoroutine(caption.Instance.SaveImage(true));
}
else
{
//set isListening to false
isListening = false;
}
}
catch
{
}
}

Next, both the ‘OnFinalSpeechResult’ and ‘OnSpeakStop’ callbacks should be registered to the respective classes. This is done in the start method which also calls the ‘Setup’ method and the ‘StartListening’ method.

void Start()
{
//call setup method
Setup(LANG_CODE);
//register onResultCallback
SpeechToText.instance.onResultCallback = OnFinalSpeechResult;
//register onDoneCallback
TextToSpeech.instance.onDoneCallback = OnSpeakStop;
//start listening to voice
StartListening();
}

Finally, the ‘Update’ method checks if the variable ‘isListening’ is false and if so, calls the ‘StartListening’ method.

void Update()
{
//if not listening, start listening
if (!isListening)
{
StartListening();
}
}

That completes the coding task of this project.

Next, before building the application, the ‘CameraFeed’ RawImage and ‘SpeechStatus’ Text UI components need to be assigned to the captionHandler gameobject’s caption script component and SpeechHandler gameobject’s SpeechHandler script respectively.

That completes the development of this project.

Here’s a demonstration of the application.

That’s it for this project. Please let me know your feedback and suggestions. Thank you! Cheers!😀

--

--