Microsoft Azure’s Computer Vision Cognitive Service: A practical example of its use through an OutSystems mobile app

João Oliveira
valanticlcs
Published in
7 min readDec 1, 2022
Photo by Austin Distel on Unsplash

1) Computer Vision

What is it?

The “vision” is probably one of the most important senses of the human being since it provides a detailed three-dimensional description of the reality where it is inserted (Bebis et al., 2003).

In this sense, Computer Vision is a field of AI that allows computers to understand and interpret visual information from still images and video sequences. Its main functions encompass the processing, analysis and understanding of digital images and the consequent extraction of relevant information, in order to build detailed descriptions about a given event (Klette, 2014).

If AI allows computers to “think”, Computer Vision provides them with the ability to observe and understand (IBM, 2021).

How it works?

Computer Vision works according to one of the most common Machine Learning (ML) techniques applied in image analysis and recognition: deep learning using convolutional neural networks (CNN).

Computer Vision needs millions of data to be effective. For example, for a computer to be trained to recognize various types of car tires, it is necessary for the model used to be fed with large amounts of images about tires and tire-related components, so that the computer “learns” to recognize the similarities. and differences between each type, so that later you can inform the user of the most assertive option.

With this in mind, ML uses algorithmic models that allow a computer or system to be “self-taught” in the context of visual data. If the required amount of data is fed through the model, the computer will ‘look’ at the data and learn how to differentiate one image from another. This is one of the advantages of using algorithms, as they allow the machine to ‘learn’ by itself.

On the other hand, CNNs help the ML model, as they decompose images into pixels by cataloging them through tags or labels. Like humans, CNNs allow the model to recognize the distance between the components of an image and its texture, by analyzing their shape and positioning, accumulating information about them as it iterates its recognition predictions. (IBM, 2021).

Photo by Ion Fet on Unsplash

2) Computer Vision as a Microsoft Azure Cognitive Service

The Computer Vision API provided by Microsoft Azure provides software developers with the ability to access advanced image processing algorithms without having to have in-depth knowledge of ML or AI. By uploading a photo file or specifying a given image URL, Microsoft Azure algorithms can analyze visual content in different ways, based on user-defined inputs.

The use of this API encompasses several advantages and improvement of the production quality of the software since, compared to other services, the learning curve of Microsoft Azure Computer Vision is much smaller, since the platform has several tutorials, examples, and quick releases. starts already prepared.

After implementation, response times are usually quite good as Microsoft provides a high level of SLA (Service Level Agreement), ensuring that image processing will be available 99.9% of the time.

Photo by Turag Photography on Unsplash

3) Microsoft Azure’s Computer Vision and OutSystems

The example provided below refers to an academic project where an application for the management of sports performance in CrossFit was developed.

Among the many features implemented in the app, the one that brought added value to the project was the one that allowed users to record their daily workout without having to manually type it into the app.

In CrossFit, it is common for workouts to be written on a board, so when inserting the daily workout into the app, users would be able to record it through a photo, and the application would automatically transcribe the workout. content of the submitted image to the appropriate field, taking the user out of the task of typing it manually.

In the OutSystems coding platform, the connection to the cognitive services of Azure was made through a REST API, composed of two methods (figure 1):

- PostRecognizeText;

- GetTextOperation.

The first method takes an image in JPEG or PNG format with at least 50x50 pixels and at most 10000x10000 pixels and returns a URL, named OperationLocation in text format.

On the other hand, the second method receives the OperationLocation from the first and returns all characters found in the submitted image.

These two methods are later used in two server actions so that the process of digitizing the workout by the user is possible. As can be seen in figure 1, the AzureRecognizeText action is responsible for analyzing and processing the textual content present in the submitted image, while AzureGetText extracts the characters making them visible to the user.

Figure 1 — REST API methods

From the user’s point of view, in order to obtain a description of your workout through the “Scan WOD” functionality, it is necessary to perform 3 specific actions.

At first, it is necessary to upload the image to the application. This process can be done through a momentary photo or by resorting to the gallery of the smartphone (figure 2).

Figure 2 — Image upload

Then, the image is uploaded to the cloud as soon as the user presses the “Process WOD” button (figure 3).

Figure 3 — Image content processing

Internally, what happens is the following (figure 4):

1) As soon as Microsoft Azure Blob Storage receives a binary file (image), it triggers the platform’s event handler, Azure Functions.

2) Then, the API call is made, which returns a URL in text format, called OperationLocation (PostRecognizeText method).

3) All metadata about the loaded image, including API processing results, are stored in Cosmos DB.

4) After that, the URL of the loaded image is sent to the app (OperationLocation).

Figure 4 — Detail steps of the first API call

On the user’s side, a message is displayed showing that the processing of the content present in the image has been successfully completed, and it is necessary for the user to press the “Get WOD Description” button to obtain it (figure 5).

Figure 5 — Text returned from the API

Again, the internal process is carried out as follows (figure 6):

1) The OperationLocation is sent back triggering the event processor, Azure Functions.

2) Next, the Cosmos BD is searched for the image data, using the received OperationLocation.

3) The image ID is sent to Blob Storage, responsible for storing all binary files that are loaded by the smartphone.

4) Then, the binary file referring to the received OperationLocation is returned, and the API call is executed.

5) The API, in turn, returns a structure in JSON format, containing all image parameters (GetTextOperation method).

6) Before these data are sent to the user, they are processed, where only the lines with characters are filtered.

Figure 4 — Detail steps of the second API call

4) Final Thoughts

Traditionally, the process of analyzing and implementing services such as Computer Vision required a wide range of knowledge in ML, proving to be time-consuming and very detailed work due to the need to train the data set at hand.

However, in the current reality of the industry, the intelligent services that provided Computer Vision abstract these same complexities through a simple call to a web API, thus allowing a much faster and simpler software development, insofar as the necessary knowledge to perform this process is based on the use of RESTful endpoints.

In the example shown, we could see that workout digitization, through Computer Vision, presents itself as a very powerful and revolutionary tool to optimize the user’s sporting experience in a very practical and intuitive way.

Thanks for reading!

References

Bebis, G., Egbert, D., & Shah, M. (2003). Review of computer vision education. IEEE Transactions on Education, 46(1), 2–21. https://doi.org/10.1109/TE.2002.808280

IBM. (2021). What is computer vision?. Accessed August 4, 2021 at https://www.ibm.com/topics/computer-vision

Oliveira, João (2021). Aplicação destinada à gestão da performance desportiva no CrossFit suportada por um serviço cognitivo de inteligência artificial [Dissertação de Mestrado, Instituto Superior de Engenharia do Porto]. http://hdl.handle.net/10400.22/19548.

Klette, R. (2014). Concise Computer Vision (1st ed). Springer. https://doi.org/https://doi.org/10.1007/978-1-4471-6320-6

--

--