Developing and Deploying ResNet in a Web Application: CarNets from Google images

Xiwang Li

9 min readNov 16, 2018

From theory to deployment, step by step in an hour.

You can deploy your image reconization ConvNet in a hour, following this blog. See what I got:

If you want to see my code directly, check my Git repo: (https://github.com/XiwangLi)

What is ConvNets (or Convolutional Neural Networks)?

There are many good introductory blogs, videos, and courses regarding CNN. For example:

Stanford CS231 course site: http://cs231n.github.io/convolutional-networks/

Adrew Ng’s coursera site: https://www.coursera.org/learn/convolutional-neural-networks

I am a newbie here, I will try to explain it from a layman’s point of view. ConvNets are often used in computer vision (object recognition). I know they are also used in natural language processing to capture the word meaning together with the surrounding words (yes! N-gram). But I will only talk about its application in computer vision (image recognition).

What are images from computer vosion’s point of view?

First, I always ask myself, what is image from computer’s point of view? Images are just numbers!! Pixel numbers. Three channels (red, green, and blue) for color images, and one channel for grey images. Each number is from 0 to 255. At the same time, image resolution is another important feature. The resolution is the count of pixels. Below is an illustration of the same image with different pixel resolution:

Source: https://en.wikipedia.org/wiki/Image_resolution

So computer sees the image as a (3, n, m) matrix for color image and (1, n, m) matrix for grey images. For example, the image below is an illurtration of grey image of umber “8”.

If we convert it to a (1, n, m) matrix it will be:

ConvNet architecture

Now we know images are just matrices (or tensors), then ConvNets are just like regular neural networks processing the input matrices. The first concept that we need to know about ConvNets is filter (also called kernel). ConvNets apply a predefined filter on top of the input image (matrix of image pixels), multiplying and adding the values of the kernel and partial input features to generate the output (element wise matrix multiplication). Here in order to keep the same size after the filter operation, we will add padding and select the appropriate stride length. Please check Adrew Ng’s course (https://www.coursera.org/learn/convolutional-neural-networks) for the details about padding, stride, pooling, and more.

*2D convolution operation 1x1 border zeros padding and 2x2 strides (source:* *deeplearning.net*)

Per what Adrew Ng’s said in his course, we can use different filters to identify vertical/horizonal lines:

For example:

In a regular ConvNet, we will stack filter layer (conv layer), pooling layer (max pool, average pool), as well as fully connected layer (regular Neural Networks) many times to get to the final answer. See the architecture of VGG for example:

A visualization of the VGG architecture (source: https://www.cs.toronto.edu/~frossard/post/vgg16/)

ConvNet architectures

From the discussion previously, we see ConvNets are usually made up of three types of layers: Convolution layer, pooling layer, and fully connected layer. Sometimes, people also explicitly write the activation function as a layer (like in keras: model.add(Activation(‘relu’)).

Now let’s have a closer look at the layers of VGG-16.

From the figure above, we can see it follows the general pattern as:

INPUT →[[conv →RELU]*n →POOL?]*m →[FC →RELU]*k →FC

For the details about ConvNet architecture, please check http://cs231n.github.io/convolutional-networks/). I am always wandering the number of parameters (weights) to be trained in the ConvNets, there is a very good example provided in the cs231 page (I directly copied it below):

VGG parameters (Stanford CS231n course site)

From AlexNet (8 layers) to VGG (16 or 19 layers), the CNN architecture becomes deeper and deeper. We are hope that the prediction accuracy will keep improve as the ConvNet becomes deeper and deeper. Then people came up with a new architecture, ResNet-34 (plain), which has 34 layers. However, people (scientists) found that simply stacking layers together cannot always improve the accuracy. The reason behind this is the notorious gradient vanishing and explosion problem. For the details about gradient vanishing and explosion, and how to solve them, please check Adrew Ng’s course at https://www.coursera.org/lecture/deep-neural-network/vanishing-exploding-gradients-C9iQO

Therefore, scientists from Microsoft introduced a so-called “identity shortcut connection” that skips one or more layers in the ConvNet. This idea turned out very useful in improving the image recognition capability and won the ImageNet competition in 2015 (VGG was the winner in 2014). It was also the first time surplus human accuracy.

A residual block (https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035)

After the invention of ResNet, there are quite a few variants based on its original structure, such as ResNeXt. ResNeXt, in my layman’s opinion, is just split a layer in ResNet to be a few and merge them together in the end. There is a good blog in Towards Data Science about this topic: https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035

Other than RELU, there are a few other activation functions. Sigmoid or softmax are always used as the final layer for classification problems. Sigmiod is often used for bi-class classification, and softmax is often used for multi-class (more than two) classification problems.

The training process for ConvNet is exactly the same as that for regular neural networks (or any other types of netural nets). We will define a loss function between the predicted targets (vector) and the true targets (vector). It is common that we will choose cross entropy for loss function for classification. Then we use gradient descent (stochastic gradient descent, mini-batch) to minimize this loss function by adjusting all the parameters (weights) in the model.

cross entropy as loss function

Another important hyper-parameter for gradient descent is learning rate, there are a lot of dicussion online about how to choose a better learning rate (cannot be too large, cannot be too small). There is an interesting paper by Leslie Smith about one-cycle learning rate (Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates). He proposed a dynamic learning idea where the learning rate will increase at the begining and decrease after. In this way, the optimization is able to converge faster. Check that paper, if you are interested in this topic. There also are a few existing code implementing this idea.

Okay. I do not want to talk too much about optimization, since I only know little about it. But there are a few types of gradient descent optimization algorithms (we also need to specify it when we develop our model), like in Keras:

keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

Adam, RMSprop, Adagrad, etc., are all provided in Keras (or any other deep learning framework). Check out their detailes (http://ruder.io/optimizing-gradient-descent/).

Creating my CarNet from Google images

After some random chat about my thoughts in ConvNets, I am going to talk about another toy project I finished last weekend.

The basic idea of this toy project is to develop an image recognition model and deploy it to a toy website (https://carnet-xiwang.now.sh). see the screen shot below, it was able to recognize a Mazda CX-5.

My CarNet deployment: recognizing a Mazda CX5

The basic logic of this toy project is to 1) directly downloading/uploading images from google image to AWS instance. The searching key words are the labels for images, 2) developing and training ConvNet for image recognization, 3) deploying the developed ConvNet to a website for demo/playing 😄

You can access to my notebook for detailed code in my git repo. I will explain key ideas here briefly.

Google image dowloading: data collection

Getting the image urls from google image

Adrian Rosebrock worte an excellent article about how to create dataset for image recognition from google image. The basic idea is using a line of javascript code (see the code below) in browser CONSOLE to collect urls for all the images in the google image (what you searched for).

urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

2. Uploading urls to AWS and downloading the images

The javascript code will save all urls into a txt file, I then upload this txt file into my AWS instance. I than run the code for “Mazda CX-5”. It will downlad and save all the images in to a folder. The steps for other cars are identical.

folder = 'mazdacx5'
file = 'mazdacx5.txt'

path = Path('data/Cars')
dest = path/folder

dest.mkdir(parents=True, exist_ok=True)

download_images(path/file, dest, max_pics=200)

After all the images are dowloaded, let’s have quick peep of them:

Train ConvNet model

Here, as I am taking the deep learning course from Fast.ai, I am using the fast.ai deep learning package:

from fastai import *
from fastai.vision import *

Transfer learning concept is also used here: train the model from ResNet34:

learn = create_cnn(data, models.resnet34, metrics=error_rate)
learn.fit_one_cycle(10, max_lr=slice(3e-5,3e-4))Total time: 02:53
epoch  train_loss  valid_loss  error_rate
1      1.037346    1.071718    0.367347    (00:18)
2      0.931142    0.970327    0.311224    (00:17)
3      0.830875    0.884162    0.290816    (00:17)
4      0.712374    0.762323    0.239796    (00:16)
5      0.622550    0.747793    0.260204    (00:16)
6      0.538034    0.683835    0.229592    (00:17)
7      0.460081    0.623278    0.204082    (00:17)
8      0.398304    0.586122    0.178571    (00:18)
9      0.354416    0.580828    0.173469    (00:16)
10     0.318548    0.570182    0.183673    (00:16)

Only for the demo purpose, in this first round toy project, the accuracy was around 82% after 10 epochs. See the confusion matrix for testing dataset below. It seems that there are 3 BMW X5 and 3 Ford edge were precited as Volvo XC60 (are they look similar? 😢). 2 Toyota RAV4 and 2 Mazda CX5 were predicted as BMW X5 (price is three times higher, okay? 😰).

After a short research of why the accuracy cannot be higher, I found a few reasons (I plan to fix them and see how accurate my model can achieve, so stay tuned for my updates).

data size. I did not use too many images (I can use more)
data quality. There are a lot of images are just pics for tires, wheels, seats, or even transmissions 😆. I tried to remove some of them, but manually removing is tooooo tidious. I cannot spend a whole day on it for a toy project. I notice that there is a ‘Cars Dataset” from Stanford (https://ai.stanford.edu/~jkrause/cars/car_dataset.html). I will also try that in the future, but they saved all the meta data and labels in a Matlab .m file (what? why? 😅). I do not have a Matlab license since I graduate from school. Update: I found the data (images and labels from Kaggle: https://www.kaggle.com/jutrera/training-a-densenet-for-the-stanford-car-dataset/data). I will try this ASAP.
model parameter tuning. I also did not spend too much time on tuning the parameter (I need to learn more and practice more on this for my next step)
Maybe I should try ResNet 50 (a more complicated mode)

Deployment

I deployed my CarNet through ZEIT (https://zeit.co/). The process is very straight forward.

Install and setup ZEIT server
Upload my model (saved as a file) in to google drive or dropbox, and get a download link for this file
Speficfy a server file (provided by ZEIT), including the download link for your model, your website alias, classes (for classfication). So for my case, it is:

['BMWX5',
 'F150',
 'Fordedge',
 'GrandCherokee',
 'ToyotaRAV4',
 'acurardx',
 'hondacrv',
 'mazdacx5',
 'subaruoutback',
 'test',
 'volvoxc60']

4. Then go to the alias (website) to test your model

5. Done

Conclusion

This is a homework type of toy project for ConvNet development, from data collection to model deployment. I introduced what/how I did that, and also introduced my layman understanding of ConvNets. If you feel this information is useful and want to know more, I highly recommend you check Fast.ai deep learning courses (https://course.fast.ai). Please also let me know, if you have any comments or suggestions regarding my work. Let’s learn and grow together.