TensorFlow
Published in

TensorFlow

How HBO’s Silicon Valley built “Not Hotdog” with mobile TensorFlow, Keras & React Native

The author’s development setup with the attached eGPU used to train Not Hotdog’s AI.
  1. The App
  2. From Prototype to Production
    V0: Prototype
    V1: Tensorflow, Inception & Transfer Learning
    V2: Keras & SqueezeNet
  3. The DeepDog Architecture
    Training
    Running Neural Networks on Mobile Phones
    Changing App Behavior by Injecting Neural Networks on The fly
    What We Would Do Differently
  4. UX, DX, Biases & The Uncanny Valley of AI

1. The App

2. From Prototype to Production

V0: Prototype

Example image & corresponding API output from Google Cloud Vision’s documentation
  1. First and foremost, its accuracy in recognizing hotdogs was only so-so. While it’s great at recognizing a large amount of things, it’s not so great at recognizing one thing specifically, and there were various very common examples that would fail during our experiments with it in 2016.
  2. Because of its nature as a cloud service, it was necessarily slower than running on device (network lag is painful!), and unavailable offline. The idea of images leaving the device could also potentially trigger privacy & legal concerns.
  3. Finally, if the app took off, the cost of running on Google Cloud could have become prohibitive.

V1: TensorFlow, Inception & Transfer Learning

V2: Keras & SqueezeNet

SqueezeNet vs. AlexNet, the grand-daddy of computer vision architectures. Source: SqueezeNet paper.
  1. During the training phase, it’s much faster to train a smaller network. There’s less parameters to map in memory, which means you can parallelize your training a bit more (larger batch size), and the network will converge (i.e., approximate the idealized mathematical function) more quickly.
  2. In production, the model is much smaller and much faster. SqueezeNet would require less than 10MB of RAM, while something like Inception requires 100MB or more. The delta is huge, and particularly important when running on mobile devices that may have less than 100MB of RAM available to run your app. Smaller networks also compute a result much faster than bigger ones.
  1. A smaller neural architecture has less available “memory”: it will not be as efficient at handling complex cases (such as recognizing 20,000 different objects), or even handling complex subcases (like say, appreciating the difference between a New York-style hotdog and a Chicago-style hotdog)
    As a corollary, smaller networks are usually less accurate overall than big ones. When trying to recognize ImageNet’s 20,000 different objects, SqueezeNet will only score around 58%, whereas Vgg will be accurate 72% of the time.
  2. It’s harder to use transfer learning on a small network. Technically, there is nothing preventing us from using the same approach we used with Inception & Vgg, have SqueezeNet “forget” a little bit, and retrain it specifically for hotdogs vs. not hotdogs. In practice, we found it hard to tune the learning rate, and results were always more disappointing than training SqueezeNet from scratch. This could also be caused or worsened by the open-world nature of our problem.
  3. Supposedly, smaller networks rarely overfit, but this happened to us with several “small” architectures. Overfitting means that your network specializes too much, and instead of learning how to recognize hotdogs in general, it learns to recognize exactly & only the specific hotdog images you were training with. A human analogue would be visually-memorizing exactly which of the images presented to you were of a “hotdog” without abstracting that a hotdog is usually composed of a sausage in a bun, possibly with condiments, etc. If you were presented with a brand new hotdog image that wasn’t one of the ones you memorized, you would be inclined to say it’s not a hotdog. Because smaller networks usually have less “memory”, it’s easy to see why it would be harder for them to specialize. But in several cases, our small networks’ accuracy jumped up to 99% and suddenly became unable to recognize images it had not seen in training. This usually disappeared once we added enough data augmentation (stretching/distorting input images semi-randomly so instead of being trained 1,000 times on each of the 1,000 images, the network is trained on meaningful variations of the 1,000 images making it unlikely the network will memorize exactly the 1,000 images and instead will have to learn to recognize the “features” of a hotdog (bun, sausage, condiments, etc.) while staying fluid/general enough not to get overly attached to specific pixel values of specific images in the training set.
Data Augmentation example from the Keras Blog.
  • Batch Normalization helps your network learn faster by “smoothing” the values at various stages in the stack. Exactly why this works is seemingly not well-understood yet, but it has the effect of helping your network converge much faster, meaning it achieves higher accuracy with less training, or higher accuracy after the same amount of training, often dramatically so.
  • Activation functions are the internal mathematical functions determining whether your “neurons” activate or not. Many papers still use ReLU, the Rectified Linear Unit, but we had our best results using ELU instead.

3. The DeepDog Architecture

Design

  • We do not use Batch Normalization & Activation between depthwise and pointwise convolutions, because the XCeption paper (which discussed depthwise convolutions in detail) seemed to indicate it would actually lead to less accuracy in architecture of this type (as helpfully pointed out by the author of the QuickNet paper on Reddit). This also has the benefit of reducing the network size.
  • We use ELU instead of ReLU. Just like with our SqueezeNet experiments, it provided superior convergence speed & final accuracy when compared to ReLU
  • We did not use PELU. While promising, this activation function seemed to fall into a binary state whenever we tried to use it. Instead of gradually improving, our network’s accuracy would alternate between ~0% and ~100% from one batch to the next. It’s unclear why this happened, and might just come down to an implementation error or user error. Fusing the width/height axes of our images had no effect.
  • We did not use SELU. A short investigation between the iOS & Android release led to results very similar to PELU. It’s our suspicion that SELU should not be used in isolation as a sort of activation function silver bullet, but rather — as the paper’s title implies — as part of a narrowly-defined SNN architecture.
  • We maintain the use of Batch Normalization with ELU. There are many indications that this should be unnecessary, however, every experiment we ran without Batch Normalization completely failed to converge. This could be due to the small size of our architecture.
  • We used Batch Normalization before the activation. While this is a subject of some debate these days, our experiments placing BN after activation on small networks failed to converge as well.
  • To optimize the network we used Cyclical Learning Rates and (fellow fast.ai student) Brad Kenstler’s excellent Keras implementation. CLRs take the guessing game out of finding the optimal learning rate for your training. Even more importantly by adjusting the learning rate both up & down throughout your training, they help achieve a final accuracy that’s in our experience better than a traditional optimizer. For both of these reasons, we can’t conceive using anything else than CLRs to train a neural network in the future.
  • For what it’s worth, we saw no need to adjust the α or ρ values from the MobileNets architecture. Our model was small enough for our purposes at α = 1, and computation was fast enough at ρ = 1, and we preferred to focus on achieving maximum accuracy. However, this could be helpful when attempting to run on older mobile devices, or embedded platforms.

Training

  • Sourcing more images, and more varied images (height/width, background, lighting conditions, cultural differences, perspective, composition, etc.)
  • Matching image types to expected production inputs. Our guess was people would mostly try to photograph actual hotdogs, other foods, or would sometimes try to trick the system with random objects, so our dataset reflected that.
  • Give lots of examples of things that are similar that may trip your network. Some of the things that look most similar to hotdogs are other foods (such as hamburgers, sandwiches, or in the case of naked hotdogs, baby carrots or even cooked cherry tomatoes). Our dataset reflected that.
  • Expect distortions: in mobile situations, most photos will be worse than the “average” picture taken with a DLSR or in perfect lighting conditions. Mobile photos are dim, noisy, taken at an angle. Aggressive data augmentation was key to counter this.
  • Additionally we figured that users may lack access to real hotdogs, so may try photographing hotdogs from Google search results, which led to its own types of distortion (skewing if photo is taken at angle, flash reflection on the screen visible moiré effect caused by taking a picture of an LCD screen with a mobile camera). These specific distortion had an almost uncanny ability to trick our network, not unlike recently-published papers on Convolutional Network’s (lack of) resistance to noise. Using Keras’ channel shift feature resolved most of these issues.
Example distortion introduced by moiré and a flash. Original photo: Wikimedia Commons.
  • Some edge cases were hard to catch. In particular, images of hotdogs taken with a soft focus or with lots of bokeh in the background would sometimes trick our neural network. This was hard to defend against as a) there just aren’t that many photographs of hotdogs in soft focus (we get hungry just thinking about it) and b) it could be damaging to spend too much of our network’s capacity training for soft focus, when realistically most images taken with a mobile phone will not have that feature. We chose to leave this largely unaddressed as a result.
  • We applied rotations within ±135 degrees — significantly more than average, because we coded the application to disregard phone orientation.
  • Height and width shifts of 20%
  • Shear range of 30%
  • Zoom range of 10%
  • Channel shifts of 20%
  • Random horizontal flips to help the network generalize
  • Phase 1 ran for 112 epochs (7 full CLR cycles with a step size of 8 epochs), with a learning rate between 0.005 and 0.03, on a triangular 2 policy (meaning the max learning rate was halved every 16 epochs).
  • Phase 2 ran for 64 more epochs (4 CLR cycles with a step size of 8 epochs), with a learning rate between 0.0004 and 0.0045, on a triangular 2 policy.
  • Phase 3 ran for 64 more epochs (4 CLR cycles with a step size of 8 epochs), with a learning rate between 0.000015 and 0.0002, on a triangular 2 policy.
UPDATED: a previous version of this chart contained inaccurate learning rates.

Running Neural Networks on Mobile Phones

  • Rounding the weights of our network helped compressed the network to ~25% of its size. Essentially instead of using the arbitrary stock values derived from your training, this optimization picks the N most common values and sets all parameters in your network to these values, which drastically reduces the size of your network when zipped. This however has no impact on the uncompressed app size, or memory usage. We did not ship this improvement to production as the network was small enough for our purposes, and we did not have time to quantify how much of a hit the rounding would have on the accuracy of the app.
  • Optimize the TensorFlow lib by compiling it for production with -Os
  • Removing unnecessary ops from the TensorFlow lib: TensorFlow is in some respect a virtual machine, able to interpret a number or arbitrary TensorFlow operations: addition, multiplications, concatenations, etc. You can get significant weight (and memory) savings by removing unnecessary ops from the TensorFlow library you compile for ios.
  • Other improvements might be possible. For example unrelated work by the author yielded 1MB improvement in Android binary size with a relatively simple trick, so there may be more areas of TensorFlow’s iOS code that can be optimized for your purposes.

Changing App Behavior by Injecting Neural Networks on The fly

What We Would Do Differently

  • More carefully tune our data-augmentation parameters.
  • Measure accuracy end-to-end, i.e. the final determination made by the app abstracting things like whether our app has 2 or many more categories, what the final threshold for hotdog recognition is (we ended up having the app say “hotdog” if recognition is above 0.90 as opposed to the default of 0.5), after weights are rounded, etc.
  • Building a feedback mechanism into the app — to let users vent frustration if results are erroneous, or actively improve the neural network.
  • Use a larger resolution for image recognition than 224 x 224 pixels — essentially using a MobileNets ρ value > 1.0

UX/DX, Biases, and The Uncanny Valley of AI

Source: New Scientist.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store