Image Captioning in Unity Android using Azure Computer Vision

Published in

CodeX

10 min readAug 28, 2021

Image Captioning is the ability to use Artificial Intelligence and Machine Learning to generate descriptions for images similar to how a human would describe the contents of an image.

Developing such image captioning models are done using Artificial Neural Networks. LSTMs are a type of most commonly used Neural Network in this field. Training such a model to generate captions with high accuracy is a very intensive task. Fortunately, Microsoft Azure provides a free service to analyze images and extract information including captions for that image.

This article is regarding how to use Azure Computer Vision Services in Unity. We will be using the Azure Computer Vision Service to generate an image caption for a selected image from the mobile gallery but this is only for the purpose of demonstrating a method to use Azure Computer Vision in unity. The Azure Computer Vision Service has a lot more features other than image captioning which can be used simply by changing few parameters in the URL.

Azure Computer Vision API — “An AI service that analyzes content in images and video” as defined on the Azure Computer Vision website.

You can find out a lot of information regarding Azure Computer Vision on their website which also includes a service to test the API as well.

So without further due, let’s get started.

1. Create an Azure Computer Vision Service Resource in Azure

To do this, you need an Azure account for which you can signup from here.

Next, log in to the Azure portal and go to this link to create a Computer Vision Resource. You should get the following page.

Here, select your subscription type, resource group ( create one if needed), then select a region that is closest to your location, give a name for the instance (this name will be used as the endpoint as well) and a pricing tier as you wish.

The free tier provides 5000 API calls per month with 20 Calls per minute.

Next, accept the terms and conditions and click on ‘Review + create’. Now review the details and click on ‘Create’ and you will be redirected to a page as below.

Next click on ‘Go to resource’ where you will be taken to the resource page.

Azure Computer Vision Resource Keys and Endpoint

Click on ‘Keys and Endpoint’ to get your API key and endpoint which will be used in the Unity application.

Now we have completed the creation of the Vision API Resource. You can test your resource here.

2. Create the Unity Project

Let’s create a Unity project and change the platform to Android from the build settings.

(File -> Build Settings -> Android -> Switch Platform)

Next, let’s add a Panel and to the panel, a RawImage, Button and a Text and resize and reposition as you wish. I have also changed the aspect ratio to 2160x1080 Portrait from the game window. Here is my setup:

After that, let’s create a C# script in the Assets section and name it as ‘caption.cs’ (or provide any name you want).

Before starting to code, there are several things to set up in this project.

Setting Up Newtonsoft library

Here, we will be using Newtonsoft to handle JSON related operations. In a typical C# project, we would be able to install the Newtonsoft NuGet package and use it easily but this is not the case when it comes to Unity. When using such libraries, we need to manually add the DLL file to the Assets folder in Unity. Here’s how we can do it.

Go to this link and click on ‘Download package’ to download the Newtonsoft NuGet package. Once downloaded, we will get a file with the name ‘newtonsoft.json.13.0.1.nupkg’. Rename this file to ‘newtonsoft.json.13.0.1.zip’ and extract the contents from this zip file. In the extracted folder will be a lib folder containing a folder named ‘net45’. In this folder, you will find a DLL file with the name ‘Newtonsoft.Json.dll’. Copy this file into the assets section of the Unity project.

2. Setting Up csc.rsp file

The next thing we need to do is to set up a ‘csc.rsp’ file in the Assets section of the Unity project. You can use notepad to create this file and save it as ‘csc.rsp’. (The file extension should be ‘.rsp’ and not ‘.txt’)

Include the following in this file and save it:

-r:System.Net.Http.dll 
-r:System.Web.dll

The reason to do the above is that we will be using the above libraries in our project and therefore Unity needs access to the relevant DLL files.

3. Change API compatibility Level

Next, we need to change the API compatibility level to .NET 4.x. To do this follow the following steps in your Unity project.

Edit -> Project Settings ->Player -> Other Settings -> API Compatibility Level from ‘.NET Standard 2.0’ to ‘.NET 4.x’.

The reason to do this is that we will be using ‘dynamic’ objects and also JSON Deserialization which both gave errors when using ‘.NET Standard 2.0’.

(If you only need the string result returned from the API call and do not need to JSON deserialize the object, you do not need to change the API compatibility Level).

4. Setting Up NativeGallery Plugin

Since we will be accessing the gallery on the phone, we can use the NativeGallery plugin which is freely available thanks to the developer. You can download the package from Github or directly include it from the Unity Assets store. If you choose to download the package from Github, drag and drop the package file into the Unity Assets folder and select import all on the pop-up window that appeared.

Now we have finished setting up, so let’s start coding.

C# script

First, let’s import the libraries we need for this project. You can skip this and later add the libraries whenever an error is indicated (the required library import will be automatically suggested by Visual Studio).

using Newtonsoft.Json;
using System;
using System.Collections.Generic;
using System.IO;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Web;
using UnityEngine;
using UnityEngine.UI;

Now let’s set up the member variables needed for our project. Create static string variables to store the subscription key, the endpoint and the uriBase (which includes the services we request from Azure). Next, create two public variables to store the RawImage and the text gameobjects we created previously. To do this, open the ‘caption.cs’ script in Visual Studio and create the following variables in the caption class:

// Add your Computer Vision subscription key and endpoint
static string subscriptionKey = "PASTE_YOUR_COMPUTER_VISION_SUBSCRIPTION_KEY_HERE";//azure endpoint
static string endpoint = "PASTE_YOUR_COMPUTER_VISION_ENDPOINT_HERE";//azure endpoint service accessed
static string uriBase = endpoint + "vision/v3.2/describe?";public Text showText;
public RawImage imgView;

Next, let’s create a function to make the API call. Sample code for this task is provided by azure here. My code is also obtained from this repository and I have changed it to suit our scenario.

First, let’s create an async method with a try-catch block in it as below:

async void MakeRequest(string path)
{
 try
 {
 }
 catch(Exception e)
 {
 }
}

In the try block, create an HttpClient, add the subscription key, create the URI and setup the request parameters as follows:

HttpClient client = new HttpClient();
var requestParameters = HttpUtility.ParseQueryString(string.Empty);// Request headers
client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey);// Request parameters
requestParameters["maxCandidates"] = "1";
requestParameters["language"] = "en";
requestParameters["model-version"] = "latest";// Assemble the URI for the REST API method.
string uri = uriBase + requestParameters;HttpResponseMessage response;

Also, declare an HttpResponseMessage object as above.

Next, we need to get the image as a byte array. Include the following code to do this:

// Request body
byte[] byteData = GetImageAsByteArray(path);

The ‘GetImageAsByteArray()’ function is a function to convert an image to a byte array which we will implement later. So ignore any errors for now.

Next, include the following code snippet, to call the API and get the result asynchronously.

using (var content = new ByteArrayContent(byteData))
{
 content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream"); // Asynchronously call the REST API method.
 response = await client.PostAsync(uri, content); // Asynchronously get the JSON response.
 String responseText = await response.Content.ReadAsStringAsync(); try
 {
  //convert result to a dictionary
  var jsonResult = JsonConvert.DeserializeObject<Dictionary<string, dynamic>>(responseText);  //obtain captions object from jsonResult
  var captionsObj = jsonResult["description"]["captions"];  //convert to string
  String captions = captionsObj.ToString();  //replace '[' and ']' symbols
  captions = captions.Replace("[", "");
  captions = captions.Replace("]", "");  //reuse dictionary to store captions dictionary
  jsonResult = JsonConvert.DeserializeObject<Dictionary<string, dynamic>>(captions);  //get CaptionText Object
  var captionText = jsonResult["text"];  //set Textview to caption
  showText.text = captionText.ToString();
 }
 catch (Exception e)
 {
  //display any exception
  showText.text = e.Message + "\n " + responseText;
 }
}

To provide a brief explanation of the above code, using the byte array of the image we obtained, we make an API call to the URI previously created. Next, we store the result in the ‘responseText’ string variable. The result we get is in the JSON format as shown here.

So, first, we Deserialize the JSON String into a Dictionary using JsonConvert. Next, we obtain the ‘descriptions’ object and from that, the ‘captions’ object. This is done using the following code (already included in the previous code segment):

//obtain captions object from jsonResult
var captionsObj = jsonResult["description"]["captions"];

Unfortunately, we cannot directly obtain the caption text since there are ‘[’ and ‘]’ symbols in the caption object which prevents us from further deserializing the object. Therefore we need to convert this ‘captionsObj’ to a string and remove the ‘[’ and ‘]’ symbols. Now we can Deserialize this string back to a Dictionary and obtain the text of the caption. This was done using the following code:

//convert to string
String captions = captionsObj.ToString();//replace '[' and ']' symbols
captions = captions.Replace("[", "");
captions = captions.Replace("]", "");//reuse dictionary to store captions dictionary
jsonResult = JsonConvert.DeserializeObject<Dictionary<string, dynamic>>(captions);//get CaptionText Object
var captionText = jsonResult["text"];//set Textview to caption
showText.text = captionText.ToString();

These codes were included previously in the MakeRequest() function, so take care not to add them again. The final MakeRequest() function should look as follows:

async void MakeRequest(string path)
{
 try
 {
  HttpClient client = new HttpClient();
  var requestParameters = HttpUtility.ParseQueryString(string.Empty);  // Request headers
  client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey);  // Request parameters
  requestParameters["maxCandidates"] = "1";
  requestParameters["language"] = "en";
  requestParameters["model-version"] = "latest";  // Assemble the URI for the REST API method.
  string uri = uriBase + requestParameters;  HttpResponseMessage response;  // Request body
  byte[] byteData = GetImageAsByteArray(path);  using (var content = new ByteArrayContent(byteData))
  {
   content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");   // Asynchronously call the REST API method.
   response = await client.PostAsync(uri, content);   // Asynchronously get the JSON response.
   String responseText = await response.Content.ReadAsStringAsync();   try
   {
    //convert result to a dictionary
    var jsonResult = JsonConvert.DeserializeObject<Dictionary<string, dynamic>>(responseText);    //obtain captions object from jsonResult
    var captionsObj = jsonResult["description"]["captions"];
    
    //convert to string
    String captions = captionsObj.ToString();    //replace '[' and ']' symbols
    captions = captions.Replace("[", "");
    captions = captions.Replace("]", "");    //reuse dictionary to store captions dictionary
    jsonResult = JsonConvert.DeserializeObject<Dictionary<string, dynamic>>(captions);    //get CaptionText Object
    var captionText = jsonResult["text"];    //set Textview to caption
    showText.text = captionText.ToString();
   }
   catch (Exception e)
   {
    //display any exception
    showText.text = e.Message + "\n " + responseText;
   }
  }
 }
 catch(Exception e)
 {
 }
}

Now let’s create the GetImageAsByteArray() function. This function is provided in the azure example on Github.

byte[] GetImageAsByteArray(string imageFilePath)
{
 // Open a read-only file stream for the specified file.
 using (FileStream fileStream = new FileStream(imageFilePath, FileMode.Open, FileAccess.Read)) {
  // Read the file's contents into a byte array.
  BinaryReader binaryReader = new BinaryReader(fileStream);
  return binaryReader.ReadBytes((int)fileStream.Length);
 }
}

This function basically takes in the path of an image file and returns a byte array of the image.

Now, to open the gallery on the phone and allow the user to pick an image and generate a caption, let’s include the following code:

public void PickImage()
{
 //set max size
 int maxSize = 512;
 NativeGallery.Permission permission = NativeGallery.GetImageFromGallery((path) =>
  {
   if (path != null)
    {
     //change showText to waiting
     showText.text = "Waiting...";     // Create Texture from selected image
     Texture2D texture = NativeGallery.LoadImageAtPath(path, maxSize);     if (texture == null)
     {
      Debug.Log("Couldn't load texture from " + path);
      return;
     }     //  set image Texture
     imgView.texture = texture;     //pass path to get caption
     MakeRequest(path);
    }
   }, "Select a PNG image", "image/*");
}

Inside the callback of the GetImageFromGallery() function, we call the MakeRequest() function and pass the image file path to obtain the caption.

For more information on NativeGallery functions, visit the Github repository of the developer.

Now, we are done with coding. Let’s go back to unity and add the ‘caption.cs’ script as a component to the button gameobject we created. Next, drag and drop the RawImage and Text gameobjects to the caption script variables as shown below.

Next, select the button, and in the inspector window, in the Button section, under the OnClick() section, click on the + icon and add a click event. Select the button as the object and set the function to ‘caption.PickImage’ as shown above.

Now we have completed our project. Build and run the project to test the application. Here’s what I got:

Unity Android and Azure Image Captioning Example

This application would seem to be quite useless because obviously, we can see what’s in the image😅 but I created this application to demonstrate how to use the Azure Computer Vision service in Unity. There can be so many other useful applications of this service that can be integrated into unity and Android. Here’s an example where I created an application to describe the surrounding environment for the visually impaired — Virtual Eyes — A simple surrounding describing app for the visually impaired.

That’s it for this project. Please let me know your feedback. Thank you! Cheers!😀

Image Captioning in Unity Android using Azure Computer Vision

1. Create an Azure Computer Vision Service Resource in Azure

2. Create the Unity Project

Written by Nirmal Mendis