Pizza is all around me

Published in

codequest

7 min readApr 29, 2019

Do you remember the biggest mistake of your life? I have an opportunity to make one every week, during codequest’s pizza Fridays. Apparently some of our employees like pineapple on their dough, and so the HAWAIIAN PIZZA is ordered for them. So, as usual, the pizza has arrived, everybody rushed for the meal and in the battle chaos, by mistake, I’ve reached for this abomination. And in that moment I got my eyes opened: don’t make this mistake anymore — pizza recognition app!

Life is a bumpy road and the best way to avoid making mistakes is to trust your developers. Even better when you are a developer yourself and you can spend every Friday on a project of your choice. That’s how I decided to try machine learning tools on iOS.

Neurons… Neurons everywhere!

One of the most popular techniques for image recognition nowadays is to use neural networks. Every neuron takes a vector of input values, multiplies them by corresponding weights, and hands the sum of them to an activation function, giving the neuron’s response.

https://www.kdnuggets.com/2016/10/artificial-intelligence-deep-learning-neural-networks-explained.html

A single neuron can be taught to recognize simple forms, like a letter on a 6x6 matrix (36 inputs of the neuron),

http://edu.pjwstk.edu.pl/wyklady/nai/scb/wyklad3/w3.htm

hence for recognition of more sophisticated objects, one uses networks consisting of thousands of neurons. What’s more, they can be arranged in layers that are responsible for recognition of higher level features, e.g. 1st layer will recognise ovals and lines. 2nd layer will lighten up when presented the oval with line inside (aye, it’s an eye!). 3rd layer is to be awakened when 2 eyes and 2 ears are on the input, resulting in an indication of a cat.

This type of neural network is called CNN (convolutional neural network) and is widely used for image recognition.

https://www.semanticscholar.org/paper/Steganalysis-via-a-Convolutional-Neural-Network-Couchot-Couturier/7e9708d9dc8b0a4ac2fa52eb384d67f52d7cbbe4

Training the model

Getting bored? Me too. No wonder, I am an iOS developer and creating abstract single neurons that are incapable of doing anything serious is dull. Don’t reinvent the wheel! Many people have been there already and CNNs trained for classifying images are out there. Tools for retraining them as well. Just reach for them and use TRANSFER LEARNING for high precision classifications on your categories (i.e. pizza/hawaiian).

The most popular Mobile Machine Learning Frameworks are:

ML Kit https://developers.google.com/ml-kit/
Core ML https://developer.apple.com/documentation/coreml
Fritz https://www.fritz.ai/
TensorFlow mobile https://www.tensorflow.org/lite

however many different iOS ML tools are listed here: https://github.com/alexsosn/iOS_ML

We will talk about creating the model later in this article.

CoreML vs Google’s ML Kit

Gather tools before going for an adventure, ask, compare and decide. Google got its leg in the iOS ML door by introducing MLKit in 2018, working on both platforms! Below is the short comparison of both frameworks:

CoreML

iOS only
converters to *.mlmodel provided by Apple
runs locally and locally only
vision, natural language, gameplayKit

MLKit

Android & iOS, easy cooperation with firebase services
converters to *.tflite provided by Google
Model evaluation in the cloud (10000+ labels, 1000 requests/month for free) or limited evaluation on the device (400+ labels)
image labeling, text recognition, face detection, barcode scanning, landmark recognition, smart reply (pending)

What’s important, MLKit also supports hardware acceleration on iOS:

https://medium.com/tensorflow/tensorflow-lite-now-faster-with-mobile-gpus-developer-preview-e15797e6dee7

Although MLKit seems a good option, I’ve tested the native iOS ML framework: CoreML.

Machine learning on iOS is like an onion. It has layers.

CoreML, introduced in iOS 11, is a foundation for higher order ML frameworks: Vision (image analysis), Natural Language (language processing) and GameplayKit (game logic evaluation). It works as a middle layer built on low-level primitives that allow faster operations based e.g. on GPU computations and built-in neural network subroutines.

What is the main task of CoreML? To classify your pizza! Well, not only pizza. CoreML provides the possibility of running previously trained models (*.mlmodel) on the device. And do it in a simple manner. And by saying simple, I mean it (using vision here):

3 main steps (get a model, run classification, display the results) = 3 lines of code.

Hardware support

This strong CoreML-Hardware connection implies an important, obvious fact: not all devices/iOS versions support the CoreML.

The first chip to support the CoreML is A7 (starting iPhone 5S) and for better performance in A11 chip (iPhone 8/8s/X) a dedicated “Neural Engine”/NPU has been introduced. Yup, animating the poo animoji is run on dedicated hardware.

Live recognition support

Taking a photo and running a classification worked nicely, but tapping a “take photo” button with your fingers covered with cheese and pepperoni was hard. That’s why I decided to introduce live recognition and that’s where Vision framework came in handy.

Vision has built-in methods for:

analysis of the still images (including face, face landmarks, text, barcode detection),
image registration
object tracking

Evaluation of recognition is computationally expensive, so you shouldn’t do it with every frame, therefore you have to make sure the image is still. To achieve this you should record the history of last XX frames (e.g. 20) and save their x/y alignment. When the average value of alignment is below a specified threshold, that means the image is stable!

Code snippet taken from Apples examples, triggering the ML classification when camera is stil:

Setting up Vision consists of 3 stages:

creating the queue of vision tasks
creating an analysis request with the specified model
running the analysisRequest with VNImageRequestHandler on the previously created queue

VNCoreMLRequest has a VNClassificationObservation result that stores the info about the classification identifier (our folders name: pizza or hawaiian) and confidence of the classification. Voila! Just update the UI and inform the user about the classification.
The whole process is nicely (with code) described here: https://developer.apple.com/documentation/vision/training_a_create_ml_model_to_classify_flowers

CreateML

Ok. So we know how to evaluate a trained model. But how do we get one? Previously mentioned transfer learning can be made in many ML tools, like TensorFlow, Numpy or Keras, but in 2018 Apple introduced its own CreateML framework. It’s extremely easy to use, it’s fast (hardware optimisation) and outputs a *.mlmodel, ready to be imported into your project. Training of image classifier can be done in playgrounds, again, with 3 lines of code:

In the provided GUI you should drag&drop the prepared image sets (divided into appropriate subfolders — names of your custom classes). After that, the transfer learning will start automatically. You can select options for training, like image augmentation and a number of training iterations.

Gathering the data

One of the most crucial tasks in machine learning is data preparation. This meant that I had to take photos of plenty of pizzas (single slices, full pizzas, in good light conditions, in the dark, pure, half-chewed, photos from different angles, etc.) and the same amount of hawaiian slices. You can imagine how difficult was to take photos of pizza when people were eating it.

It’s important that the sets have comparable sizes, otherwise the model would have tendencies to classify the image to a class that had a greater size. You could obviously try getting the images from the endless internet, but the users of your app won’t use internet pizza images in real life. They WILL use their phones, with their cameras, in conditions far from perfect.

Training the model with about 1200 images (600 pizza and 600 hawaiian) took a few minutes. Greater part of this time is taken by image features extraction. With all the image enhancement options enabled this process took about 30 minutes (on a 2016, 2.7GHz MacBook Pro).

Create ML by default takes a random number of images to test your model (80–20 rule). I have achieved validation accuracy around 90%. Seems good for a first try.

Where to go from here? MORE MORE MORE

Data quality and amount in machine learning are important. To make my models more accurate and bulletproof, I need more images from different devices and different variations of pizza/hawaiian. It’s not possible to retrain the model on the device, but it’s possible to upload a new model to users. The Idea is to help the users save the world: make an image, label it properly and upload to our servers. We will retrain our model and upload a new classificator to your device. https://developer.apple.com/documentation/coreml/core_ml_api/downloading_and_compiling_a_model_on_the_user_s_device

“It ain’t a pineapple on your dough, is it?” From now on, there should be no unanswered question: is it hawaiian, or is it a pizza?