A Beginner’s Tale of A First Computer Vision Project

Making a robot able to see — Part 1


This article will be part one in a two parts series describing a project combining the extremely hot phenomenon that is Deep Learning (DL), more specifically Computer Vision, and Robot Operating System (ROS). The combination resulted in a pipeline of model training and object classification from a camera stream, pictured below. This first article will describe the model training part, more specifically the training of the model used in the “fast.ai” script figure in the image. Part two will describe the ROS part of the system, more specifically the initialization of its Pub/Sub feature, and the rest of the setup in the image.

Highly sophisticated drawing of the components used in the project, and how they interact.

Documentation for the project can be found on GitHub. Check out the README file for a description of how to replicate this model training. Furthermore, pictures of code and output will be used in this article. Anybody who wishes can clone the repository and try it out for themselves. It is important to note that the “fast.ai” module is a submodule — which needs to be updated by itself.

Make a Meme

As for any article about DL, we’ll have to use some technical terminology. These will not be explained in detail, and an effort to minimize this amount will be made, as the aim is to give a brief and simple overview of the project and the technology used. However, links are provided for those readers who are interested.

This project has a lot of similarities to the “Emmy Award” nominated software from Silicon Valley’s “Not Hotdog” (except for the fact that this model can classify multiple labels, not just binary), which is objectively cool. Another similar project, which served as inspiration, can be found here.

Data set

So, where does one start for training such a model? Well, the first thing to do is to define a data set, containing images and their respective labels describing the correct classification of the images, using so-called supervised learning. There are several ways to define a data set like this; one can manually download relevant images and put them in a folder named after the images’ labels, or one can find/make a script to do the job. The latter alternative is way less work, more interesting, and more scalable. Here is an example of such a script being implemented, specifying, among other things, the folder structure to data/’label_name’.

def download_images(searchword, form="jpg", lim=100, directory= "data"):
if not os.path.isdir(directory):
! mkdir $directory
! googleimagesdownload --keywords $searchword --format $form --limit $lim --output_directory $directory
src_path = os.path.join(directory, searchword)
if not os.path.isdir(src_path):
! mkdir $src_path

For the full method, please see the aforementioned GitHub project, which also divides the set into training and validation set, with a ratio of 70/30 respectively.

Today’s DL models are able to obtain fairly good results with small data sets, in this case about 600 images for the training set, and 200 for the validation set. However, we use transfer learning which improves the results significantly, while also decreasing the training time. Furthermore, the DL framework utilized, fast.ai, offers a built-in data augmentation method which basically changes the orientation of the images, flips horizontally, vertically, etc., while keeping the content the same. This leads to an increase of the data set about 4 times (at runtime).

PATH and data cleaning

The aforementioned DL framework, fast.ai, is a framework built on PyTorch. It implements the same default paths to the training and validation set, more specifically “data/training” and “data/validation”, which is why the “image download” cell also has this as default, while also making one folder per search string, in both the training and validation folders. By following these rules, one can easily tell the image classifier object where to look for the data.

How Much is Dirty Data Costing You?

Many emotional movies, commercials, etc., ask the question “If you could tell your younger self anything, what would it be?” After about six months of practicing DL, one can safely state that the part that many people associate with DL, the training of a model, is really the easy part, while the hard part is firstly getting the data and then verifying and cleaning the data.

For this project, this manifested as downloaded images that turned out not to be JPEG images — which is odd, because one specifies the wanted file type of the images when downloading. There are most certainly ways of converting to JPEG, but because of time, or lack thereof, an easy fix was to delete all data of irrelevant type(s), i.e. anything having another extension than .jpeg.

for path in file_paths:
for files in os.listdir(path):
file_path = os.path.join(path, files)
if imghdr.what(file_path != 'jpeg':


“Brain-training” Games Ineffective for ADHD

Now that the data set is defined, separated and cleaned, the actual model training can start. Using the fast.ai framework makes this quite easy: 4 lines of code is all it takes:

tfms = tfms_from_model(arch, sz, aug_tfms=augmentation, max_zoom=1.1)
data = ImageClassifierData.from_paths(PATH, tfms=tfms, bs=bs, num_workers=1)
learn = ConvLearner.pretrained(arch, data, ps=0.4)
learn.fit(1e-3, 1)

Explaining the function of these parameters requires a whole new article, however, at a high level, this is what it does:

  1. Define a transforms object, which as mentioned earlier, increases the data set at runtime
  2. Define an image classifier object, which holds the data set and applies the data transforming
  3. Define a learn object, which as mentioned earlier uses transfer learning
  4. Run the data through the model, i.e. train the model.

There you go — done, terminado, completo! The model is now trained, and ready for use. However, if one wishes to get anywhere near the potential of this framework and the models, hyperparameter tuning, and more training, is the next step.

Hyperparameter tuning (and more training)

Car Photo Tuning — Professional Virtual Tuning 2.2 APK

There are quite a large number of possible hyperparameters to optimize, but for this article, the focus was mainly on learning rate and dropout. Batch and image size were also adjusted, however, this was mainly done because of Cuda-out-of-memory error (maxed GPU memory).

In order to find the optimal learning rate the fast.ai framework offers a neat trick:


which basically plots the relationship between different learning rates and their resulting loss. One quick thing to note here is that this gets tricky for limited data sets, like this project — the learning rate finder plots a pretty volatile curve. However, a general rule of thumb is that 1e–2 is a good place to start, then one can try and train the model with various values, but with the aforementioned value as sort of a reference point.

Furthermore, when using a pretrained model in fast.ai, it will have two thirds of the layers in the network frozen, which essentially means that they will not be updated during training. One can unfreeze these in order to specialize the model to one’s data set, and possibly get a nicer learning rate curve.


This can potentially also lead to overfitting, which essentially is training loss < validation loss, and here is where dropout comes in. Rule of thumb: when a model is overfitting, increase the dropout rate, and even try differential dropout (same principle as differential learning rate).

Lastly, the model is saved in order to be available for the fast.ai script from the image in the introduction:



The first model from MultiClassClassification, Resnet34, obtained an accuracy of about 82,7% (on ~1000 images pr. label). This is a fair result, considering the data (have you tried going to page >20 on Google Images?), the amount of data, and the fact that hyperparameter tuning was done manually. In a perfect world, one would have >10k images of each class, not having to delete any of them and tune the hyperparameters by either Grid Search or Random Search.

With that said, 82,7% is a satisfactory number in order to make a robot “able to see”, which is what part two will focus on — using the trained and saved model to classify images streamed from a web camera.

Like what you read? Give Sindre E. de Lange a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.