How Do I Use Azure API in Object Detection?

Divyesh Dharaiya
Sep 12 · 10 min read

This blog is the first in a series. We will cover some parts of Object Detection in each blog. For other parts click here:

Wondering what’s all this hype about deep learning about? How can you as a practitioner use it to add value to your organization? These series of blog posts will help you understand what object detection is in general. What are the key performance metrics to keep an eye on? How you can leverage state of the art methods to get the job done succinctly in less time.

Outline:

1. Understanding the Problem
2. Using Azure API for Object Detection
3. Overview of Deep Learning Models

Prerequisites:

1. Knowledge of Machine Learning services
2. Knowledge about Web APIs and their working
3. Introductory knowledge about Performance Metrics

What is Object Detection?

Object Detection, in a nutshell, is about outputting bounding boxes along with class labels signifying objects enclosed within these bounding boxes. There can be multiple objects in a single image like a chair, handbag, desk, laptop, etc. Multiple objects could be of the same type say two bottles or different types. They may also overlap with each other.

Image for post
Image for post
Image URL: https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Detected-with-YOLO--Schreibtisch-mit-Objekten.jpg/330px-Detected-with-YOLO--Schreibtisch-mit-Objekten.jpg

Object Detection vs Image Segmentation:

Object Detection is different from the Image segmentation in the sense that, in Image Segmentation, we try to get or mark the exact pixels. Typically, we want to mark each pixel which signifies that it is part of the object say handbag. Hence, in Object Detection, we care more about bounding boxes and in Image Segmentation, we care more about pixels. Algorithms which work on pixel level like UNets are time-consuming in nature. Such algorithms try to build a pixel map such as to say that all these pixels belong to the laptop. And as we want object detection to be faster, we are working here with bounding boxes.

Now, we know what is the input and what is the output, right? Simply, image is the input and output is the bounding box. There are multiple ways of representing a Bounding Box: say using height and width. And for each Bounding Box, we wanna know what’s the object enclosed within it.

Using Azure API for Object Detection:

We will work our way with the help of Azure API as it doesn’t charge you for one weekend while Google Compute does charge you with a credit card and a lot of readers might not have one. Keep in mind, here I am assuming that, you know the basics of Web-APIs and how they work. I am not going to explain how they work from the underlying details.

If you just do a google search saying “azure object detection python”. The very first search result you get is a code set on how to do it.

Image for post
Image for post

Image URL: https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/images/windows-kitchen.jpg

If you will go through the documentation, you will see, that there is an image that works as input, and JSON is returned as output which is pretty much the same format used by each and every API these days.

<pre>
<code>
{
“objects”:[
{
“rectangle”:{
“x”:730,
“y”:66,
“w”:135,
“h”:85
},
“object”:”kitchen appliance”,
“confidence”:0.501
},
{
“rectangle”:{
“x”:523,
“y”:377,
“w”:185,
“h”:46
},
“object”:”computer keyboard”,
“confidence”:0.51
},
{
“rectangle”:{
“x”:471,
“y”:218,
“w”:289,
“h”:226
},
“object”:”Laptop”,
“confidence”:0.85,
“parent”:{
“object”:”computer”,
“confidence”:0.851
}
},
{
“rectangle”:{
“x”:654,
“y”:0,
“w”:584,
“h”:473
},
“object”:”person”,
“confidence”:0.855
}
],
“requestId”:”a7fde8fd-cc18–4f5f-99d3–897dcd07b308",
“metadata”:{
“width”:1260,
“height”:473,
“format”:”Jpeg”
}
}
</code>
</pre>

Let’s try to make sense of the JSON output: we can see that the object returned is, in fact, an array comprising of several sub-objects within it. These parts signify what are the possible objects detected by the model from the input picture. We can see for bounding boxes (which are mostly rectangle in shape), we are returned central coordinates, height, and width. Also, for a prediction being made, it returns the confidence value which is a probability explaining how much confidence the model has in that prediction to be correct. The confidence value ranges from 0–1. You may even get to see, the hierarchy of objects such as a laptop being also predicted further being a computer which is true when considering the real world.

Limitations of the Azure API:

If you’ll go through some of the limitations:

1) It cannot detect objects which are less than 5% of the total area of the image.

2) Also, objects which are stacked together are a bit difficult to be identified.

3) It can’t differentiate brand or product names.

For the final point, there is a different API altogether from Microsoft
dynamics 365 consultants
, let’s pay it a visit.

Computer Vision API-2.0

Our Object Detection feature is part of Analyze Image API. Now, assuming that you know about Web-based APIs, and how localhost sends some requests to the server and it returns some output based on some request variables/parameters. If you ask only for an object, it will give you that only else when not specified, other valid feature types like adult content, brands, colors, faces, celebrities, landmarks, etc. will be returned as well. You can also specify the language you want the output to be in. Say, for example, English, Chinese, Japanese, Spanish, etc.

You can also look at the JSON given below and can see similar to what we had seen above, this one is describing several types of objects present in the input image and making predictions for it with some confidence value. Also, we can see the bounding box for the object present in the input image at last.

<pre>
<code>
{
“categories”: [
{
“name”: “abstract_”,
“score”: 0.00390625
},
{
“name”: “people_”,
“score”: 0.83984375,
“detail”: {
“celebrities”: [
{
“name”: “Satya Nadella”,
“faceRectangle”: {
“left”: 597,
“top”: 162,
“width”: 248,
“height”: 248
},
“confidence”: 0.999028444
}
],
“landmarks”:[
{
“name”:”Forbidden City”,
“confidence”: 0.9978346
}
]
}
}
],
“adult”: {
“isAdultContent”: false,
“isRacyContent”: false,
“adultScore”: 0.0934349000453949,
“racyScore”: 0.068613491952419281
},
“tags”: [
{
“name”: “person”,
“confidence”: 0.98979085683822632
},
{
“name”: “man”,
“confidence”: 0.94493889808654785
},
{
“name”: “outdoor”,
“confidence”: 0.938492476940155
},
{
“name”: “window”,
“confidence”: 0.89513939619064331
}
],
“description”: {
“tags”: [
“person”,
“man”,
“outdoor”,
“window”,
“glasses”
],
“captions”: [
{
“text”: “Satya Nadella sitting on a bench”,
“confidence”: 0.48293603002174407
}
]
},
“requestId”: “0dbec5ad-a3d3–4f7e-96b4-dfd57efe967d”,
“metadata”: {
“width”: 1500,
“height”: 1000,
“format”: “Jpeg”
},
“faces”: [
{
“age”: 44,
“gender”: “Male”,
“faceRectangle”: {
“left”: 593,
“top”: 160,
“width”: 250,
“height”: 250
}
}
],
“color”: {
“dominantColorForeground”: “Brown”,
“dominantColorBackground”: “Brown”,
“dominantColors”: [
“Brown”,
“Black”
],
“accentColor”: “873B59”,
“isBWImg”: false
},
“imageType”: {
“clipArtType”: 0,
“lineDrawingType”: 0
},
“objects”: [
{
“rectangle”: {
“x”: 25,
“y”: 43,
“w”: 172,
“h”: 140
},
“object”: “person”,
“confidence”: 0.931
}
]
}
</code></pre>

Big Giants like FAANG, provide code in their documentation. You can see, in the code section with Python Language, that we are providing Subscription Key as part of the request headers. You can also see several parameters that we discussed to be part of the request. You have to specify in the visual features part it to be either an object, face, or something else. Then comes another code snippet, saying that we first establish an HTTPS connection. You will be returned the byte stream data. If you wanna print it, you can use: print(json.dumps(response.json()))

<pre><code>
########### Python 2.7 #############
import httplib, urllib, base64

headers = {
# Request headers
‘Content-Type’: ‘application/json’,
‘Ocp-Apim-Subscription-Key’: ‘{subscription key}’,
}

params = urllib.urlencode({
# Request parameters
‘visualFeatures’: ‘Objects’,
‘details’: ‘{string}’,
‘language’: ‘en’,
})

try:
conn = httplib.HTTPSConnection(‘westcentralus.api.cognitive.microsoft.com’)
conn.request(“POST”, “/vision/v2.0/analyze?%s” % params, “{body}”, headers)
response = conn.getresponse()
data = response.read()
print(data)
conn.close()
except Exception as e:
print(“[Errno {0}] {1}”.format(e.errno, e.strerror))

####################################

########### Python 3.2 #############
import http.client, urllib.request, urllib.parse, urllib.error, base64

headers = {
# Request headers
‘Content-Type’: ‘application/json’,
‘Ocp-Apim-Subscription-Key’: ‘{subscription key}’,
}

params = urllib.parse.urlencode({
# Request parameters
‘visualFeatures’: ‘Categories’,
‘details’: ‘{string}’,
‘language’: ‘en’,
})

try:
conn = http.client.HTTPSConnection(‘westcentralus.api.cognitive.microsoft.com’)
conn.request(“POST”, “/vision/v2.0/analyze?%s” % params, “{body}”, headers)
response = conn.getresponse()
data = response.read()
print(data)
conn.close()
except Exception as e:
print(“[Errno {0}] {1}”.format(e.errno, e.strerror))

####################################
</code>
</pre>

If you’re interested in looking at the complete code, visit this.

Performance Metrics:

How do we measure if an algorithm is good or not? There are a couple of terms you should be comfortable using:

1) Ground Truth: It is absolute truth; generally labeled or given by a human. It will be a bounded box drawn by a human when asked to do so. In Machine Learning lingo, represented by…

2) Predictions: It is the prediction made by machine/model. In Machine Learning lingo, represented by…

You want to see how close is the machine prediction to the human annotation. What you will do? We will take both rectangles aka bounding boxes, and will compute something called Intersection over Union or IoU.

Image for post
Image for post

Now, what should be the ideal case? When both bounding boxes are completely overlapping over each other, the value of will be 1. What is the worst case? When both bounding boxes don’t overlap at all i.e. intersection = 0. Resultingly, IoU will be 0. The common threshold which is used is that, if , then your prediction is referred to as positive (in case of a binary classification setup). It is also sometimes called as 50% . Now, this is a performance for one bounding box. When we have multiple objects in the same image, there will be many bounding boxes as well. So?

We were saying that whether the predicted rectangle overlaps 50% or more with ground truth rectangle or not. A rectangle overlap problem is converted to a binary classification problem. But for multiple objects, we have a multi-class classification problem. Now, for each class (chair, person,..), we will compute average precision from all the objects of the same class which can be calculated using the Area Under Precision Curve.

Once you have computed average precision for each of the classes, take the mean/average of all of them, and you will get mean-average-precision (). Many research papers have notation signifying that we are calculating with. Don’t confuse it with MAP(maximum a-posteriori) which is there in statistics.

Overview of Deep Learning Models:

As of now, we have covered how a non-deep learning person can make use of available APIs without worrying about algorithmic stuff to perform object detection. Now, we will move towards the algorithmic section. As you had seen, our input is image or video. We can break the video in a sequence of images and can give it to the model and it will be perfectly all right. Now, what is the output? We want bounding boxes and associated object class labels. We will make use of the COCO dataset for our understanding. It contains 80 class labels and a few thousands of images. Hence, a fairly large dataset and is well-curated for tasks such as Image Segmentation and Object Detection.

Now, the main trade-off is Speed vs. Speed here basically means that given an image to the input algorithm, how fast it can give you the output. It can be measured in milliseconds or frames per second. So, if the speed of the algorithm to process an image comes out to be 50ms, it is roughly equal to 20 fps. As, 1sec = 1000ms, hence 20 images/sec. Humans see the world in 24 fps. For systems like self-driving cars where there is a little time to identify other vehicles and lane, real-time face detection systems where there is only a little time to identify the entrant, speed is very critical. There are other places where or average precision is very important. Say, medical diagnosis and Optical Character Recognition (OCR) where we cannot afford to make too many mistakes at the expense of faster results.

For good we have, algorithms like R-CNN, Fast R-CNN and faster R-CNNs, Feature pyramid network-based FRCN. Faster versions of these algorithms are available to do tasks faster as fundamentally they are not designed for that. Then, we have algorithms like single-shot detection and RetinaNet. Also, algorithms like YOLO v1 (2015), YOLO v2/ YOLO 9000 (called so as it can recognize 9000 objects on Imagenet Dataset) and the recent one in the lineup is YOLO v3 (Apr 2018). There are other 30+ odd algorithms for the same purpose.

Image for post
Image for post
Source: https://pjreddie.com/media/files/papers/YOLOv3.pdf

Now, looking at the benchmarks image, we can see that YOLO v3, has three variants as per the sizes of the image that it works with. YOLO-320 -basically says the sizes of the input images are 320x320. Our objective is to get higher and less time. We can see that, YOLO v3 is super fast and has very good. When the input image is smaller in size, it takes less time to process. While, when the input image is larger in size, it takes more time but there are more chances of correctly detecting smaller objects. So, the choice of the variants among those provided by the YOLO v3, depends on how small objects you wish to detect. YOLO v3 is a really great architecture that is aggregating good things from various other models.

In the next blog post, we will take a look at the architecture and tweaks which make YOLOv3 to be one of the best models in object detection space. Until then, happy learning!

Gain Access to Expert View — Subscribe to DDI Intel

Data Driven Investor

empowering you with data, knowledge, and expertise

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Divyesh Dharaiya

Written by

Divyesh is working as freelance a Marketing Consultant specializing in blogging, editor and different digital marketing service provider.

Data Driven Investor

empowering you with data, knowledge, and expertise

Divyesh Dharaiya

Written by

Divyesh is working as freelance a Marketing Consultant specializing in blogging, editor and different digital marketing service provider.

Data Driven Investor

empowering you with data, knowledge, and expertise

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store