AI-first, a modern anti-pattern.

14 min readMay 10, 2020

Engineering lifecycle of an AI-powered app

The goal

This post shares my thoughts about:

Rotation-invariant algorithm to detect a display on the image and read the weight value written on it

In this article, I describe a robust solution for display orientation localization which is a preprocessor for the digit detection problem. On top of that, I would like to share my thoughts on the overall process of building an AI-powered product.

The state of ML tools today available on Google Cloud

The Google Cloud AutoML Object Detection and their whole AI Platform worked surprisingly well for me. Even bigger surprise is that the trained models can be exported so you can use a runtime on your own to run them! Once again, engineering time in real world applications will be spent on massaging the data.

ML principles for real-world applications

Unless you are writing a school project, or sharpening your ML skills, the deeper ML challenges in the real world projects come last. Engineering will be dominated by data selection / gathering / cleaning / reformatting and evaluation frameworks.

Targeted audience

Software Engineers, Data Scientists, Product Managers and other Rock and Roll people.
Less likely, if you are using the photo weigh-in application, here you can learn how the AI behind it works.

Introduction

I started drilling on SnapScale in mid-2018 with a goal of creating a weigh-in habit for a friend of mine. I converted their old school scale into a smart one using the phone’s camera. The user scans the weight measurement with a phone and the app recognizes the value, logs it, and optionally exports to FitBit profile.

Feel free to try it out — no log-in or account creation is required — SnapScale — build a weight-in habit.

This report gives a timeline of an end-to-end real-world application heavily relying on ML tools.

App implementation

User-facing product

The goal of SnapScale is to instill a weigh-in habit in humans. Taking a weight measurement has an amplified awareness effect if the number is logged, and if you can compare it with last week or last month. It’s the trend that matters, not day-to-day oscillations.

The app lets you log the weight measurement by taking a photo of the scale’s display while you’re standing on the scale. User doesn’t need to lean down — a photo taken from the normal hand height works. The AI model reads out the digits and logs the weight.

There are 3 screens in the app:

submit a new weight-in measurement that launches a camera
view your previous measurements
settings screen that optionally sets up FitBit export and no-weigh-in reminders

Tapping the weigh-in button launches the camera.

We intentionally reduced clutter that could waste the user’s time in the app. There is no login required or account creating to start using the app. Optionally if a user wants to export their data to FitBit, the login with FitBit is required.

AI Last

Engineering-wise, the central and most “sexy” part of the application is the component that reads the weight measurement from the image. I delayed tackling that part of the problem. Instead, I tried adopting the “AI Last” principle. That meant first doing the following:

Built a mobile app for photo-based weight-logging, reminders, and export to FitBit.
Launched with a fake backend — each image would trigger a notification on my phone for me to read the numbers instead of the ML model
Implemented the UI for reading the numbers from images and labelling them
Implemented the evaluation framework for image recognition problem

Only then, when everything was ready and it seemed that the product may work, I tackled the AI:

picking the technology and implementing the image recognition solution

This is what I mean by AI comes last.

By the time I started iterating on the model, I labeled ~450 images and had a good idea of all kinds of corner cases that might happen. I also had a dataset of images, defined metrics, debugging frontend and evaluation framework in place.

One risk of this approach is betting everything on the fact that the last step (AI) will work. It could’ve happened that OCR at the end would be too difficult to implement, and the project would “fail slowly”.

Mobile app framework

The application is built using Expo framework which is a wrapper around React Native. It’s suitable for fast prototyping which is exactly what I needed. Same codebase is executed on both platforms. If Expo had a book it would be called “App development for dummies”.

Backend

Backend is used to store weight logs and images that will be used for model training. The clients don’t do any authentication, but there is a lightweight handshake step to guard from spamming.

Although login sounds like a good alternative, I suspect it would repel some users due to sensitivity of the images that are being sent. This way we keep user images stripped of any other kind of PII.

The database for metadata used is Google Cloud Datastore and for the image store I used Google Cloud Storage. Both systems are free-of-charge for prototypes with low amount of requests. Also, no setup is needed.

The natural next step is to move the model execution to the client, which will remove a need for a backend, and simplify privacy policy.

Labelling UI

When a new image is submitted to the system, and the model is not confident about the results (or when the model was not even there yet) an IFTTT notification would hit my phone with the link to the Labelling UI. The Labelling UI would show the image, let me translate and rotate it, and label the digits. The digits are labeled with rectangles that all share y coordinates, and all have the same dimensions. This is possible because prior to labelling the digits, the labeller rotates the image in such a way that the display is upright.

Note that I made an assumption that the digits are always monospace. Well :) Assumption turns out to be the mother of all fuckups.

I manually labeled ~450 images so far. Each image has 3–4 digits, each digit is two mouse clicks. The Labelling UI is implemented in HTML5, mainly using Canvas API in JavaScript. The image transformation logic heavily uses affine transformation matrices — DOMMatrix.

The images without visible weight measurement are discarded. All other images are labeled and split into training/test/validation with 70/10/20 ratios.

Having only valid images in the dataset, I defined accuracy to be my main metric:

accuracy: percentage of images in which all digits are correctly identified

If an image that shows 78.2 gets recognized as 78.3, the accuracy is not satisfied.

Image recognition

I started working on Image recognition ~2 months in the project. At that time, I had an application that works, with 1–2 OCR requests per day all labeled manually. I was ready to implement the digit recognition submodule. I expected that I will find something that works out of the box, and that most of engineering will be just plugging it in and evaluating accuracy.

Existing work

I found a bunch of existing related models that would in theory solve our problem. Reproducibility of those projects was low.

OpenCV approaches on GitHub:

OpenCV approaches articles:

Street View House Number Dataset related solutions:

Open Source Tensorflow Object Detection pipeline:

github.com/tensorflow/models research/object_detection

Documentation about Google Cloud object detection solutions as service:

My time in investigating and trying out existing solutions was roughly spent like this:

30% Classical computer vision techniques (OpenCV and similar)
30% SVHN dataset
2% Google Cloud Vision API
38% Tensorflow custom solutions and transfer learning

I implemented 5–6 different pipelines that will take SnapScale images and labels and convert them to a format that the solutions above take :) That’s a lot of hours spent debugging bugs in geometry and image processing code.

I didn’t expect to get in a situation where I’ll be (re)training new models. Especially because I had only 450 images available.

Iterations with OpenCV

I spent 2–3 weeks trying to tune OpenCV algorithms to preprocess the images into binary images where the digits are distinguished from background. This is the usual preprocessing step used by related work from the first two sections above. Assuming that this step is successful, the following phases would run connected components to isolate each digit and then run shape classification to recognize the final number.

Here are some examples where OpenCV adaptiveThreshold was able to do the job correctly after painful tuning process:

Unfortunately a good amount of images looked more challenging for binarization:

My confidence further deflated when I figured that many of the images in the set are not sharp .. that screws up both binarization and later contour detection:

As negative examples were not a minority of my training set, I ultimately gave up on OpenCV and started looking into ML models. My next step was to use SVHN (Street View House Numbers) dataset, but I’ll skip the details there.. Long story short is that SVHN numbers are very different from the display numbers. Finally, I decided to train a custom model using the training set I collected.

Spoiler alert — all images from the above examples were correctly classified with Tensorflow Object Detection explained below.

Iterations with custom-trained object detection

From their documentation, Google Cloud AutoML Vision Object Detection looks too good to be true. The Web UI enables you to upload images, train the model and see evaluation results. The trainer uses transfer learning which enables it to train something useful even with a few hundred images only. No setup, no config.

Naive as I am, I started by training a model that solves everything at once: read digits from raw user image. The input is the raw user image, and the output are labeled digit rectangles.

This hit two problems:

the 7-segment digits sometimes change meaning with rotation (think 2 and 5)
the digit was sometimes too small on the original image for the model to pick up

The rotation problem is amplified by the fact that the mobile phone doesn’t know how to orientate the image of something that is on the floor.

I decided to split the problem to two phases:

detecting the display in the big image,
detecting the digits in the display.

Each phase can be evaluated on its own, which is useful in loss analysis.

Display detection

My first attempt was predicting a single rectangle of a single class display on the input image. The training set was consisting of upright rotated images labeled with a single rectangle around the display. The AutoML Object Detection learned this with very good precision. At prediction time I would rotate the image 4 times and find the maximum confidence score for the display detection. I assumed that the maximum confidence would be achieved when the display is in upright rotation, just as it was in the training set.

I was wrong. I violated the property that training input distribution needs be aligned with the runtime input distribution. The model training saw only correctly rotated images, so the confidences that I was getting for the incorrectly rotated images were rubbish — sometimes 0, sometimes 1.

The green detection box represents the input that was expected by the model, and the scores there are precise. The red detection boxes have scores that have no guarantees.

To fix this, I made the problem harder for the model. I expanded each image in the input to K=4 new images with different rotations. Rotations are done in 360 / K = 90 increments, starting from the upright image. The display label on the upright image became display_rotated_0_degrees. On the next image it became display_rotated_90_degrees, then display_rotated_180_degrees and display_rotated_270_degrees. So the number of input images and output classes grew K times.

During the prediction I would show the input image to the model. The output is the display location and how much it was rotated from the upward position.

This worked much better, but it still had some precision losses. The final slam dunk for display+rotation detection was the voting system during prediction. Because the training phase received K rotations of each image, now it became OK to present the model with K rotations of the input image during prediction time. Each of the K rotations would give some answer to where the display is and how it's rotated.

If the model was perfect, all evaluations would show the same display location, and rotation hints would yield different values (0, 90, 180, 270) for each rotation.

The final answer is given by subtracting the model’s rotation hint from the image rotation given to the model. Implementing a voting scheme here gave surprisingly good results.

Note that the hyper-parameter K doesn't need to be 4. If the model is perfect, the maximum angle distance of the hinted rotation and the correct rotation is (360 / K) / 2. If K is bigger, the display detection problem is more difficult. On the other hand, the outputs are closer to being upright rotated, which makes the digit detection problem less difficult.

Digit detection

Digit detection inputs an upright display image and outputs rectangles labeled as digits: digit_{0-9}.

I expected this to be simple for the AutoML Object Detection. Wrong again! Nothing is simple. I got many precision losses which I couldn’t easily explain.

Looking through the open sourced Object Detection Tensorflow implementation on GitHub, I found a parameter aug_rand_flip. Based on the docs, the parameter expands input training images with their horizontal flips. This parameter can't be configured in AutoML Object Detection, but if they set it to true it would surely screw up 7-segment digit object detection.

Google Cloud support answered my question on cloud-vision-discuss thread, which helped me understand that AI Platform Object Detection and AutoML Object Detection are quite different.

I submitted a training job to AI Platform (after building yet another pipeline to transform my data), and got excellent results. This was another surprise, because my training set was less than 400 images, and the AI Platform Object Detection trains from scratch — it doesn’t do transfer learning.

Results

In the last iteration I used 468 images:

317 images in training set
50 images in test set
101 images in validation set

I never saw debug visualizations or debug logs of the images in the validation set — the evaluation pipeline would never produce them.

The final accuracies (the weight value correctly recognized):

98.74% in training set
94.00% in test set
93.07% in validation set

The detection latency is ~15 seconds, mostly because I optimized serving to be price efficient (free).

Conclusion

ML comes last

Don’t start your idea pitches with “We will use ML to do XYZ”, just say “We will do XYZ”.

Even if ML will be a backbone tool in your solution the engineering is still dominated by all other (less sexy) tasks:

building infrastructure for collecting the data
implementing evaluation frameworks
implementing debugging tools
data transformation pipelines

My opinion is that great engineering practices can be shown in any engineering challenge. The sexiness or non-sexiness of problems is irrelevant (unless you’re recruiting new grads).

The state of Object Detection support

I am surprised by the simplicity and efficiency of Google Cloud AutoML Object Detection. When it comes to pricing, each account gets $300 free quota which is enough to train 5–10 model iterations. If you are lucky to get the result you need in that quota, everything else can be free — they give the ability to download the trained model as a Docker container, TF.js bundle or TF Lite bundle.

Using this free quota I managed to train a successful weight scale measurement OCR system.

ML principles I violated and got burnt

The distribution of data presented to the model in production must align with the validation set distribution.

I thought that SVHN trained models would do a good job in recognizing scale weight measurement numbers. The transfer learning from the SVHN model to my problem space also didn’t end up working.

In display orientation detection, I thought that the model trained on upright display images would give a low score to input images that are not upright. The score was not low — it was rubbish, sometimes high, sometimes low.

Be evaluation driven.

Early in the project when I was evaluating the related work, I would try to run the existing model and then feed it with an image or two to see how it works. Only when my custom quality iterations started, I built a pipeline that runs the model on all train/test/val images and reports the metrics.

It would be better if I defined a container interface early on that listens to the image request on HTTP and gives a standardised JSON output. Then I would have a Docker image for every related work I managed to successfully run, persisted together with evaluation results. Having not done so, I only have weak (empirical) evidence that the related work models didn’t work on my problem space.

When it comes to running things from other research-y works, the dependencies pose a big challenge. Docker solves this problem completely, and I am looking forward to seeing Docker adoption in academia.

Parting words

On Monday, 2019–10–28 I was in a room in downtown Manhattan packed with excited engineers and PMs who were ready to take over the world. This was AWS Data Lake Day in AWS Loft. I clearly remember a specific sentence — the speaker said confidently:

You’ve gotta use ML! If you don’t, chances are that your competitors will.

I hated it. On the other hand, in the audience I saw ripples of head-nodding folks getting high on agreement with the speaker.

I decided to embrace a different engineering principle — “Don’t go AI first — always stay user first”. ML is a tool, not a goal.

Even The Doors knew this when they played The Roadhouse Blues. They say:

Yeah, keep your eyes on the road, your hands upon the wheel.

For the road, they probably meant user goal and for the wheel they meant the tools — whatever tools are needed — a for-loop, a compiler, a database or an ML model.