Identifying the sport in an image using the Sport Vision API
A couple of months ago, we launched the Sport Vision API, with the objective of making computer vision models tailored to sport accessible to the many. You can see an introduction to the API here:
This API offers multiple services, mostly targeting pictures of sport and of sport products. If you want to try it out, make sure to look at the documentation, follow the authentication process, and join our mailing list.
In this article, we will take a deeper look at the sport classifier endpoint, which uses computer vision to identify which sport is practiced in an image. We will go over how we developed this service, by discussing the computer vision model behind the endpoint, how you can use the API, give some examples of results, and talk about what’s next for the API.
This article will go as follows:
- The image classification problem
- Tackling the problem using transfer learning
- Training a highly accurate model
- Deploying the model in production
- Calling our API
- Some examples of results
- What’s next for the API
Let’s go :D
1. An image classification problem
At Decathlon Developers, we offer a number of APIs helping you to build your sport app or Website. One of them is the Sports API, which provides you with an exhaustive list of sports practiced in the world, along with a number of related information (description, relationship to other sports, and so on) and a popularity indicator.
The objective of the sport classifier endpoint is to identify, in a picture or a video, what sport from the Sports API is practiced. This is called an image classification problem, which aims to associate a category from a predefined list to an image.
We’ve already talk about image classification in a previous blog post. Image classification is generally done using something called a convolutional neural network (CNN). To properly understand what is a CNN, we need to understand that an image is nothing more than a table of numbers.
Each number in the table describes the intensity of a pixel. If the image is in color, the table has a third dimension, generally of size 3 assuming it is a RGB image.
In a nutshell, in a CNN, we move something called a filter across the image. A filter is simply a smaller image, containing a given shape. In the example below, the filter describes an “X” shape.
While moving the filter across the image, we multiply the value of their respective pixels, sum the results, and place it in another matrix, called a convolved feature matrix. For instance, when the filter (Figure 2) is placed at the top left corner of the image (Figure 1), you can see that there are four pixels matching — hence the number 4 at the top left corner of the convolved feature matrix below. Similarly, when you place the filter (Figure 2) at the bottom left corner of the image (Figure 1), you can see that there are only two pixels matching — hence the number 2 at the bottom left corner of the convolved feature matrix below.
As such, the convolved feature matrix indicates, for each location in the image, how similar it is to the filter. Keeping the example of Figure 3, it thus tell us that the top left corner of the image (value of 4) is twice more similar to an “X” shape than the bottom left corner of the image (value of 2)!
In other words, a CNN helps us to decompose an image into the basic shapes it contains. By training a classification model (generally a feed-forward neural network) on top of the layers of CNNs, you can identify, given the shapes found in the image, to which category it belongs.
2. Tackling the problem using transfer learning
Training a computer vision model takes a lot of data. For each category you want your computer vision model to classify, a good rule of thumb is to have at least 100–1000 representative images available.
In our case, we have the luck of already having a large in-house database of sport images. If you don’t have this luck, make sure to look at open sources datasets like COCO, Open Images and Google conceptual captions to get you started.
When you have a dataset of limited size, it is always a good idea to look at the technique of transfer learning. We explained this technique in greater detail in a previous post. In a nutshell, transfer learning means beginning with a model already trained to classify images, but for different categories. A lot of these models are available on Tensorflow Hub or at tf.keras.applications.
Generally, you will want to keep the CNN layers of the model (those that decompose the image into its basic shapes, as described above — which is called the feature extraction step), and replace the classification model built on top of the CNN. By reusing part of an existing classification model instead of training from scratch, you can generally reach a much higher accuracy, with fewer data.
3. Training the model
There have been, since a couple of years, very nice and freely available solutions to train your neural network. In our team, our favorite is Google Colab.
In Colab, you get free access to a fairly powerful GPU (currently a P100) for up to 12 hours. There is even free access to a TPU, which we found very useful in the case of large datasets. Google Colab has provided us with sufficient power to train most of our image classification models, at least for the first few stages of development. On the few occasions we needed more processing power or computation time, we started a virtual machine on Google Cloud Platform.
It took a bit of research and trial-and-errors to reach the levels of accuracy that we have today. To get you started quicker in your project, here are a few conclusions we reached along the way:
- proper data cleaning is as important as having a large dataset size. Make sure you go over your dataset multiple times, because just a few images by category wrongly classified in your training dataset can significantly decrease your accuracy;
- put a lot of energy in the hyperparameters tuning step. Some of the most relevant hyperparameters are related to the structure of your feed-forward classification model (number of layers, hidden layer size), the optimization algorithm (Adam worked well, in our case) and the learning rate; and
- we also found that fine-tuning was critical to achieve high accuracy. Fine-tuning, as explained in a previous post, is the idea of slightly updating the parameters in your CNN after you have trained your feed-forward classification model.
Using these different approaches, we managed to develop a sport classification model with an accuracy of about 90%, which is quite good given that it can distinguish between more than 150 different sports. Some examples of results are provided below.
4. Deploying the computer vision model in production
When your model has been trained, sadly, we found out that only half of the job was done ;)
It turns out that deploying efficiently an AI model is as challenging as training one — but it’s definitely achievable!
At the core of our deployment approach is Tensorflow serving, a serving system for Tensorflow models. Once you have trained your model, you can save and serialize it using the SavedModel format — follow the Tensorflow documentation to do so. Make sure that you have given a proper name to your input layer, as you are going to need it when calling the model. If you are unsure, use the command line utility saved_model_cli to get the exact name and input shape.
Afterwards, you can serve the model by launching tensorflow serving, and making requests using REST. In our case, we wrap the Tensorflow serving within a Django or Flask server, responsible for the image preprocessing and postprocessing.
The main preprocessing steps consist to load the image, reshape it, and send the payload to Tensorflow serving.
The postprocessing step consists in formatting the output of the model — generally, a vector describing the probability that the image belongs to each category, and sending the response back to the user as a json.
Once you have made your first deployment, you will look at adding authentication protocols, and increasing the speed of your service. To reduce response time to a minimum, the location of your servers, the Tensorflow binary and using Protobufs are factors that you can consider.
5. Calling our API
Calling our API is very simple.
First, take a look at our documentation. You’ll see how to get your authentication token, and how to subscribe to our mailing list.
Then, simply make a request as follows:
curl -X POST sportvision.api.decathlon.com/sportclassifier/predict/ -H 'Accept: application/json' \
-H 'Authorization: XXX' \
-H 'Content-Type: multipart/form-data' \
Where you replace ‘XXX’ with your authentication token and give us the path to your image. If you want to minimize the response time, it is a good practice to resize the image to 299 x 299 before sending it to the API, which will reduce image loading time.
After a few hundreds milliseconds, you should receive the json response:
Which will tell you the sport practiced in the image, and the probability that the correct answer was found. To get more information about each sport (description, tags, picture, …), use the id received in the response and call the Sports API.
6. Some examples of results
Let’s put the API to the test! To do so, we went on Instagram, and manually extracted images with hashtags related to our brands (#decathlon, #quechua, #kalenji, #domyos, …). These are images that the Sport Vision API had never seen before.
We kept all the images of people practicing sport, and we queried the sport vision API to identify the sport practiced in the image. Here are the results:
As you can see, the API rarely makes mistakes, and generally finds the exact sport which is practiced in the image.
As such, the API represents a great tool for applications such as social listening: if you have a sport app or company, you can use the API to better know your users given the pictures that they post on your platform or on social media.
You can also use the API to build a smart search engine: that is, every time you add a picture to your database, you call the API to attach the right sport to it, such that you can easily search for the right picture in the future.
7. What’s next for the API?
The API is currently in heavy development.
In the case of pictures of sport, we are currently working on extracting additional intelligence from the image: for instance, outdoor/indoor and single/team tags. We are also working on a moderation endpoint, to identify if the picture is of sport, or of an unrelated or irrelevant topic.
In the case of pictures of sport products, we just launched a new version of our products classifier endpoint, to identify not only the type of product in the picture, but the sport that you can practice with it as well. Here’s an example of the new format of the json that you obtain:
"sports_group": "Inline skating"
"sports_group": "Roller skating"
"sports_group": "Kids bicycle"
"sports_group": "Road cycling"
"sports_group": "Mountain biking"
We are also working on our social listening services, to tag your image with all the different sport products that you can find.
Stay tuned for our next updates, and do not hesitate to contact us if there is any feature you would like us to develop!
We are hiring!
You are interested in the application of AI and computer vision to improve sport accessibility and user experience? Luckily for you, we are hiring! Follow https://developers.decathlon.com/careers to see the different exciting opportunities.