What I’ve learned building a deep learning Dog Face Recognition iOS app
I’m a software engineer in between startups. I spent some time at Google building the Google Finance charts, Multiple Inboxes on Gmail, and Starring on Google Maps, and most recently I started a shopping company called Spring.
I’m a builder and in between gigs I like working on side projects.
A few months ago I set out to build a Face Filters For Dogs camera app. You point it at your dog and it puts filters on its face.
There are 92 million photos tagged #dogsofinstagram — so it might even find a few users out there — building things that people want is extra motivating.
I’ve needed to
- build a Deep Learning model that extracts Dog Face Features
- run it on an iPhone, on top of live video
- and use ARKit to show 3D filters on top (stickers are not as cool)
I went from no Deep Learning knowledge to a pretty decent app, and wanted to share the lessons I learned at each step of the process.
I hope others that are new to Deep Learning will find this useful.
Step 1: Deep Learning is mostly ready off the shelf, with some quirks
First question I needed to answer was “ is this is even possible?”. Is my problem tractable? Where do you even start?
My problem looked solvable (the kinds of results folks get were in the range of what I need), but nothing off the shelf was easily reusable for my use case. Trying to figure out how to modify existing tutorials felt daunting.
Frustrated with reading blog posts, I turned to more fundamental online courses to start with the basics. This proved to be a really good decision.
In this process I’ve learned that:
- Andrew Ng’s Coursera course on Convolutional Neural Networks (the third in a series of courses about Deep Learning) is a great place to learn the basic concepts and tools of Deep Learning applied to Computer Vision. I would not have been able to do anything without it.
- Keras is a high level API on top of TensorFlow, and it’s the easiest to use to play around with deep learning models. TensorFlow itself is too low level and confusing for a beginner. I’m sure Caffe / PyTorch are great too — but Keras really did the job for me.
- Conda is a great way to manage python environments as you play with this. Nvidia-docker is great too, but only necessary once you get a GPU.
When you start go through the fundamentals first. Hard to learn the basic concepts from internet tutorials. Turn to a course (or book if that’s more of your style) and learn the basic core concepts. It will make your life much easier moving forward.
Step 2: Figuring out how to implement Landmark Extraction
With my newly found basic knowledge, I set out to figure out how to implement my custom model.
“Object Classification” and “Object Detection” are off the shelf today. What I am trying to do is neither — turns out the term in the literature is “Landmark Detection”. Figuring out the right term for what I was doing really helped.
Now — new questions. What kind of model would be good? How much data do I need? How should I label the data? How does training work? What’s a good minimal viable development workflow?
First goal was to get *something* working. I could work on quality later on. Walk before you run kind of thing.
What I have learned:
- Building your own data labeling UI is a really good idea. The off the shelf labelers didn’t work for me, were for Windows only, or were doing too much. The flexibility proved really useful later on when I needed to make changes in the data I was labeling (like adding new landmarks).
- Tagging speed matters. I got tagging down to about 300 images / hour. That’s one image every 12 seconds. To get 8,000 total images took 26 hours. Every second matters if you want to tag real amounts of data. Building my own tagger had an upfront cost, but really helped lower this effort.
- Manually labeling data gives you a good sense of what goes into the model.
- Pre-processing images for training seemed like a detail at first, but turned out to be critical and took me a few days to understand how to fix it. Check out this Stack Overflow question — calling
preprocess_imagein the right place made the difference in between working and completely not working. :facepalm:
This feeling of a delicate blackbox — that only works if the right details are in the right places — is something that was consistent at almost every step of the way.
Tracking bugs, identifying issues, narrowing the problem down —natural tasks in normal software engineering — are just not that easy today in Deep Learning development.
For a beginner like me, figuring out the issue felt more like magic and chance than a deterministic process. It’s unclear to me if anybody in the industry knows how to do this well — it feels more like everybody is trying to figure this out.
After about three weeks I had something in place: I could label data, train a model on it, run that model in Jupyter Notebook on a photo and get real coordinates (with dubious placement :-) ) as output.
Step 3: Make sure the model runs on iOS
With a simple working model in hand, my next step was to ensure it can run on a phone, and run fast enough to be useful.
Keras/TensorFlow models do not run natively on iOS, but Apple has its own Neural Net running framework — CoreML. Running an
.mlmodel on iOS can be done with tutorial code. I was blown away by how easy it was.
But even this simple translation step (from
.mlmodel) was not without challenges.
- Apple’s tool to convert an
.h5model into a
coremltoolsis a work in progress. The version installed with
pipdidn’t work out of the box, and I had to build it from source, in a conda environment using
python2.5. Duct tape duct tape duct tape. Hey — at least it worked. :-)
- Figuring out how to preprocess input images on the phone as the model expects them was weirdly not obvious. I asked StackOverflow, I searched blog posts, nothing. I ended up finding my answer by cold-emailing Matthijs Hollemans, and to my amazement was nice enough to help me out! He even had a blog post about it — though I would not have found it without him. Incredible!
As it turns out, this entire deep-learning tool-chain is still in development. Things change fast in Deep Leaning land.
On the other hand I love the feeling that the community was small, helpful, and active. If you’re like me, stuck, don’t hesitate to reach out and ask questions directly over email. Worst case nobody answers. Best case you find somebody as nice and helpful as Matthijs! Thank you!
My model was running, on an actual phone, at 19 Frames Per Second — it felt like an amazing milestone! With the basics in place, I could now focus on quality.
Step 4: Make the model perform well
Ooof, this took a while. How do I get production ready performance out of a deep learning model? More data? Different top layers? Different loss functions? Different activation parameters for the layers? Daunting!
Incremental steps seem best. Tweak, train, compare to previous runs, see what works. Start with small data, add slowly. Small data also means short training times. Once you have big data, waiting 24 hours for a run is not unusual, and that’s not really iterating “fast”.
Data Augmentation is code that can go wrong. Start without it, keep around a way to run without it, and then slowly add data augmentation. Make sure it’s solid. Your model will always only be as good as the data you feed into it.
Assume that time will be wasted. Assume that learning best practices will take time. You can’t learn from mistakes unless you go ahead and make them. Go ahead and make mistakes.
What I’ve learned while trying to do this:
- This might sound obvious— but using TensorBoard was an order of magnitude improvement in my development iterations.
- Debugging image data from my DataGenerator showed me image processing bugs that were affecting my model. Like that time when I was mirroring the image, but I wasn’t swapping the Left Eye with the Right Eye. :facepalm:
- Talking to people that are actively training models and have experience saved me a lot of time. A Romanian Machine Learning group and a few very generous friends (thank you Cosmin Negruseri, Matt Slotkin, Qianyi Zhou and Ajay Chainani) proved critical. Having somebody to ask when I was stuck was incredible.
- Doing anything that’s not default was generally a bad idea. Like when I tried the top layers from this blog post on the fisheries competition that were using
activation='relu'— the layers turned out to be good, but the
activation='relu'was a bad idea. Or when I tried my own
L1 lossloss function, that turned out to be worse than the more standard
- Writing a DataGenerator was necessary — data augmentation matters.
- When you run tutorials, learning, or training the first model on a few hundred images, a CPU is just fine. A GPU would have been a distraction.
- With a real data set (8,000 images) and a DataGenerator (80.000 images) training on a GPU becomes critical. Even then a training run takes 24 hours.
- Amazon’s GPUs are expensive for personal development. At 24 hours per iteration, and ~$1/hour, that very quickly adds up. Thank you Cosmin for letting me SSH into your PC and use your GPUs for free. ;-)
Despite not being perfect, the end result performed really well — well enough to build an app with it!
And I have a feeling that, were I to be a full time machine learning engineer, making it shine would be possible.
But like any product, the last 20% takes 80% of the time, and I think this will have to be work that I will have to include in the next version.
If you’re not a little ashamed of what you’re shipping, you’ve probably taken too long to put it out, right? Especially true for side-projects.
Step 5: build the iOS app, the filters, and tie it all together
With a good enough model in hand, now onto Swift, ARKit and as it turns out, SpriteKit for 2D content. iOS and its frameworks continue to impress me. The kind of stuff you can do these days on a phone is really mind blowing if you put it in perspective.
The app itself is really basic — a big record button, some swiping to switch filters, a share button.
Most of the work was in learning ARKit, and (unfortunately) figuring out its limitations. How to pull 3D models in, how to remove and add them from scenes, lighting, animations, geometry.
What I learned in the process:
- ARKit is great until it isn’t. Yes, it’s super easy to add 3D content. Yes, it’s fun and the API is great. And yes, once you drop something into a scene and leave it there — it works.
- ARHitTestResult promises that, given a pixel in your image, it will give you back its 3D coordinates — and that works but it’s really imprecise. The results are 70% of the time in the right place, and 30% of the time way off. Really put a dent in my plans to attach nice filters to the face. :-(
- Backup plan: build 2D filters. SpriteKit, Apple’s 2D gaming engine, is really easy to to use — with a built in physics engine. Fun to play with and learn (albeit superficially).
For a first generation technology ARKit, combined with CoreML, blew my mind.
Within a few weeks, I was able to run my Deep Learning model on a live video feed from the camera, extract face landmarks, show 3D content with ARKit, 2D content with SceneKit, and all this with decent accuracy.
Only two years ago, for similar technology (on human faces), SnapChat had to buy a company for $150 million dollars. Now iOS ships human face landmark detection for free, and unlike my ARHitTestResult results, the precision is spot on. Crazy how quick this kind of technology is becoming commoditized.
Give it another couple of years, once iPhones have infrared spots on the back, and the 3D mapping of your environment should get real good.
I feel like I’ve really gotten a good solid sense about where Deep Learning is, what the AI hype is really about, where the iPhone capabilities have gotten to today, ARKit, SpriteKit, Swift — to name a few.
You can’t take a Deep Learning model off-the-shelf today for anything that’s not basic, but this is not that far in the future.
If you jump through the necessary hoops, and the necessary duct tape — it feels to me like the tech is here to be used.
I didn’t have to get into the nitty gritty internals of neural networks, and I didn’t have to touch any TensorFlow directly.
High level Keras was more than enough. A week-long online course on the basics of Convolutional Neural Nets was all I needed. Of course, this doesn’t make me an expert — but it got me to a great minimal viable product.
I’ve come to believe that Apple has to be working on Augmented Reality outside of the phone. When they will launch their Magic-Leap-Equivalent product, building AR for that will be so incredibly easy — ARKit is already impressive.
After this exercise I’ve become bullish on Deep Learning, especially in computer vision. It feels like magic.
The kinds of structured data we will be able to extract from something as simple as a photo will be incredible. What dog toys you buy, how much dog food you have left, what kind of dog food you prefer, when you take your dog to the vet. Understanding everything about your relationship to your pet (or your baby, or your partner) will be possible from something as simple as your photos of it.
Thanks for reading, hope you found this useful! And if you have any suggestions please don’t hesitate — would love to make my app better!
Download it in the App Store and let me know what you think.
p.s. Huge thanks to Cosmin Negruseri, Matt Slotkin, Qianyi Zhou and Ajay Chainani for help with my efforts, and for reading this draft! Huge thanks to Andy Bons for having the original idea for this app!