Which AI is hungrier for food?

Key words: nutrition detection, food image recognition, artificial intelligence, deep learning, machine learning, food label, food apps

Purpose

The goal of this study is to compare four image recognition services: Amazon Rekognition, Google Vision, Clarifai, and Instagaze and determine which deep-learning image recognition service is the most capable and precise in detecting food label based on images.

Background

Automatic food image recognition is garnering a lot of steam, as food traceability can address problems spanning from wellness, nutritional deficiencies, healthcare applications, and diet management. Artificial Intelligence in food image detection can equally help epicures and snackers to make smarter and more conscious nutrition based decisions. Everyday we take countless pictures of food, without gazing at nutrition information or realizing how the food affects our health.

In the last study, we had collected images from internet sources and analyzed them by using leading image recognition services: Google Vision, Amazon Rekognition, Microsoft Computer Vision and Instagaze. We concluded that Instagaze had the highest image precision and label precision followed by Google Vision compared to other image recognition services.

Given the findings of our previous study, we tested Google Vision API with an image of cheese pizza taken from smartphone. Surprisingly, Google Vision, was unable to precisely detect a cheese pizza image taken from a smartphone when it had correctly recognized an extremely similar image from an internet source.

Figure 1: Image on the left is from internet, and image on the right is captured from smartphone showing corresponding labels generated from Google Vision for a slice of pizza

Food image recognition is challenging due to the nature of food items. The advancements in the food image label detection has been scanty. Foods are typically deformable objects, which makes the process of defining their structure difficult. Furthermore, there is only limited information that can be gained from food images; such as food color, food is well-lit and food’s density. Despite these obstacles, deep neural networks have outperformed traditional approaches but can become biased and unreliable in real world if trained on professionally curated images.

To get a deeper insight, we tested 100 food images taken from smartphone and benchmarked: Amazon Rekognition, Google Vision, Clarifai, and Instagaze. Clarifai and Instagaze both have specialized deep learning “Food” model that recognizes food items in images.

Experiment & Procedure

We chose images from different cuisines to avoid bias in our study. The images were first resized to 640x480 pixels and converted to JPEG format to ensure we process them across all services in the same format.

Figure 2: Personal images collected using smartphone. Strawberry cupcake (upper left corner), avocado egg toast (on the right), vegetable pasta topped with cheese (bottom left corner).

For each image, the machine learning services returned a set of labels with their respective confidence scores, original image URL and correct label which were stored into separate datasets. The datasets along with the source code can be found here.

Data Analysis

We analyzed the data based on three criteria:

  • Acceptable Label Categorization
  • Label Precision
  • Image Precision

Acceptable Label Categorization

Acceptable label classification was a challenge because there were multiple labels generated by Amazon Rekognition, Google Vision, Clarifai, and Instagaze. To solve the problem of sorting acceptable vs not acceptable labels, our trained data analysts manually curated all labels for food images. For example, in Figure 3, “Pho”, a generic name for Chicken Pho is acceptable whereas “Dish” a generic word for prepared food, is not acceptable.

Figure 3: Acceptable and Not acceptable labels for Chicken Pho

Label Precision

After reviewing all the generated labels from Amazon Rekognition, Google Vision, Clarifai, and Instagaze, we found that each machine learning service generated different amount of labels for each image. Clarifai generated the most amount of labels for all images while Amazon Rekognition generated the least amount of labels.

Figure 4: Acceptable Labels vs Not acceptable Labels across all services

Label precision was calculated as below:

Total Label Precision = Total acceptable labels per image/Total labels generated

Figure 5: Label precision across all services

We found that Instagaze had the highest label precision of 14.30% and Amazon Rekognition had the lowest image precision of 5.75%. Instagaze generated the maximum correct labels followed by Google Vision, Clarifai, and Amazon Rekognition. The correct label generation is highly important for nutritional information and dietary management.

Image Precision

Image precision is a significant aspect of this study, a higher image precision can ultimately help us estimate the portion size, nutritional value, total calories consumed during a meal. Given this importance, we looked at image precision, which is defined as how many images were correctly detected with at least one acceptable label.

Image Precision = Total images detected with an acceptable label/ Total number of images

Figure 6 : Images with acceptable and Not acceptable labels across all services. *Note: Google Vision and Instagaze were unable to detect one image.
Figure 7: Image Precision across all services

Across the four benchmarked image recognition technologies, Instagaze had the highest image precision of 85% and Amazon Rekognition had the lowest image precision of 39%. Precise image recognition is immensely helpful for creating workout plans, encouraging healthy eating and food nutrition calculations.

Conclusion

Instagaze performed better in both label and image precision compared to Google Vision, Amazon Rekognition and Clarifai. Google Vision and Amazon Rekognition provide image recognition APIs with agnostic CNNs and both of these CNNs focused on image classification, with what is present in the image (for example, food, plate). Google Vision unlike Amazon Rekognition did not perform as expected on food images taken from smartphone versus images taken from internet. Instagaze outperformed all other services with an image precision of 85%, and maintained its higher standard of results for label precision, and label volume. Instagaze’s image precision for both real world and internet images remained similar, suggesting that Instagaze’s additional machine learning layers on top of a specialized CNN highly favors food image recognition. Artificial intelligence with the help of deep neural networks can provide better food recognition technology in the near future and aid us in living a healthier lifestyle and Instagaze is a getting closer to making that a reality.