A Beginner’s Tale of A First Computer Vision Project

Sindre E. de Lange
Aug 2, 2018 · 6 min read

Making a robot able to see — Part 1


Highly sophisticated drawing of the components used in the project, and how they interact.

Documentation for the project can be found on GitHub. Check out the README file for a description of how to replicate this model training. Furthermore, pictures of code and output will be used in this article. Anybody who wishes can clone the repository and try it out for themselves. It is important to note that the “fast.ai” module is a submodule — which needs to be updated by itself.

Make a Meme

As for any article about DL, we’ll have to use some technical terminology. These will not be explained in detail, and an effort to minimize this amount will be made, as the aim is to give a brief and simple overview of the project and the technology used. However, links are provided for those readers who are interested.

This project has a lot of similarities to the “Emmy Award” nominated software from Silicon Valley’s “Not Hotdog” (except for the fact that this model can classify multiple labels, not just binary), which is objectively cool. Another similar project, which served as inspiration, can be found here.

Data set

def download_images(searchword, form="jpg", lim=100, directory= "data"):
if not os.path.isdir(directory):
! mkdir $directory
! googleimagesdownload --keywords $searchword --format $form --limit $lim --output_directory $directory
src_path = os.path.join(directory, searchword)
if not os.path.isdir(src_path):
! mkdir $src_path

For the full method, please see the aforementioned GitHub project, which also divides the set into training and validation set, with a ratio of 70/30 respectively.

Today’s DL models are able to obtain fairly good results with small data sets, in this case about 600 images for the training set, and 200 for the validation set. However, we use transfer learning which improves the results significantly, while also decreasing the training time. Furthermore, the DL framework utilized, fast.ai, offers a built-in data augmentation method which basically changes the orientation of the images, flips horizontally, vertically, etc., while keeping the content the same. This leads to an increase of the data set about 4 times (at runtime).

PATH and data cleaning

How Much is Dirty Data Costing You?

Many emotional movies, commercials, etc., ask the question “If you could tell your younger self anything, what would it be?” After about six months of practicing DL, one can safely state that the part that many people associate with DL, the training of a model, is really the easy part, while the hard part is firstly getting the data and then verifying and cleaning the data.

For this project, this manifested as downloaded images that turned out not to be JPEG images — which is odd, because one specifies the wanted file type of the images when downloading. There are most certainly ways of converting to JPEG, but because of time, or lack thereof, an easy fix was to delete all data of irrelevant type(s), i.e. anything having another extension than .jpeg.

for path in file_paths:
for files in os.listdir(path):
file_path = os.path.join(path, files)
if imghdr.what(file_path != 'jpeg':


“Brain-training” Games Ineffective for ADHD

Now that the data set is defined, separated and cleaned, the actual model training can start. Using the fast.ai framework makes this quite easy: 4 lines of code is all it takes:

tfms = tfms_from_model(arch, sz, aug_tfms=augmentation, max_zoom=1.1)
data = ImageClassifierData.from_paths(PATH, tfms=tfms, bs=bs, num_workers=1)
learn = ConvLearner.pretrained(arch, data, ps=0.4)
learn.fit(1e-3, 1)

Explaining the function of these parameters requires a whole new article, however, at a high level, this is what it does:

  1. Define a transforms object, which as mentioned earlier, increases the data set at runtime
  2. Define an image classifier object, which holds the data set and applies the data transforming
  3. Define a learn object, which as mentioned earlier uses transfer learning
  4. Run the data through the model, i.e. train the model.

There you go — done, terminado, completo! The model is now trained, and ready for use. However, if one wishes to get anywhere near the potential of this framework and the models, hyperparameter tuning, and more training, is the next step.

Hyperparameter tuning (and more training)

Car Photo Tuning — Professional Virtual Tuning 2.2 APK

There are quite a large number of possible hyperparameters to optimize, but for this article, the focus was mainly on learning rate and dropout. Batch and image size were also adjusted, however, this was mainly done because of Cuda-out-of-memory error (maxed GPU memory).

In order to find the optimal learning rate the fast.ai framework offers a neat trick:


which basically plots the relationship between different learning rates and their resulting loss. One quick thing to note here is that this gets tricky for limited data sets, like this project — the learning rate finder plots a pretty volatile curve. However, a general rule of thumb is that 1e–2 is a good place to start, then one can try and train the model with various values, but with the aforementioned value as sort of a reference point.

Furthermore, when using a pretrained model in fast.ai, it will have two thirds of the layers in the network frozen, which essentially means that they will not be updated during training. One can unfreeze these in order to specialize the model to one’s data set, and possibly get a nicer learning rate curve.


This can potentially also lead to overfitting, which essentially is training loss < validation loss, and here is where dropout comes in. Rule of thumb: when a model is overfitting, increase the dropout rate, and even try differential dropout (same principle as differential learning rate).

Lastly, the model is saved in order to be available for the fast.ai script from the image in the introduction:



With that said, 82,7% is a satisfactory number in order to make a robot “able to see”, which is what part two will focus on — using the trained and saved model to classify images streamed from a web camera.


Grensesnittet (DK: Grænseflade, EN: interface) er stedet for faglig kommunikasjon mellom ansatte i Computas og andre i bransjen. Mulighetene i teknologi er utallige og vi vet at vi kan bruke vår kunnskap til å gjøre en forskjell i folks liv.

Sindre E. de Lange

Written by


Grensesnittet (DK: Grænseflade, EN: interface) er stedet for faglig kommunikasjon mellom ansatte i Computas og andre i bransjen. Mulighetene i teknologi er utallige og vi vet at vi kan bruke vår kunnskap til å gjøre en forskjell i folks liv.