Training Tensorflow for free: Pet Object Detection API Sample Trained On Google Colab

7 min readMay 16, 2018

IMPORTANT: The information in this article is dated and won’t work without much tinkering. For a much better version, please see this article

As you might know, Google generously offer everyone access to a free reasonably powerful computer with a free GPU (!) in their Colaboratory project.

It is basically a free lunch! However, like every free lunch…. It comes with a few strings attached. Here is a quick summery:

You access the computer through Jupyter Notenook. No ssh, no X.
You have the instance for 12 hours. This means that after 12 hours everything on this computer will be wiped clean. You can still request a new instance but you have to reset everything again.
The amount of GPUs is limited, so you might get “no available GPU”. Even if you get a GPU, it seems to be not exactly the same as the NVIDIA ones, having less memory.
The CPU is single core, although reasonably powerful.

Still, a gift horse is a gift horse and if you are looking to train your own models and all you have at home are Raspberry Pis and Core 2 computers with 2gb of memory, this is really a lifesaver.

On the plus side…

The virtual machine comes pre-installed with tensorflow, Keras and OpenCV
It has about 11GB of ram
Did I mention it was free?

Now, the thing with training, especially for object detection… you will find that most articles neglect this part, or their samples do not work. The reason for this is that training involves lots of operations and things change very slightly between versions. After trying to run several samples from articles and getting all sorts or errors I decided to start with something simple:

Quick Start: Distributed Training on the Oxford-IIIT Pets Dataset on Google Cloud (but not on the cloud)

I mean, what could possible be simpler? It’s an example right out of Tensorflow!

A week later I actually managed to do it. I have much less hair now… Generally speaking that was not too bad. However, the error messages I got were… not even useless. So every time something failed I had to either read the code or google the message to understand what the problem really was.

OK, so… on with the practicalities. You can find the Jupyter notebook here. I highly recommend using the notebook as the code formatting features here don’t handle long lines well.

Register to colab

There are few posts about how to do it. The process is actually really painless.

Here is the short version:

Go to https://colab.research.google.com
Sign in with your google account
That’s it! you are in! Upload my notebook using the file->upload notebook
Be sure to select a runtime with GPU (runtime->change runtime type)

Install the prerequisites

Just the things that are missing:

!apt-get install protobuf-compiler python-pil python-lxml python-tk
!pip install Cython
!pip install jupyter
!pip install matplotlib

Clone the Tensorflow models repository

!git clone https://github.com/tensorflow/models.git

Clone the COCO repository and install the COCO object detection api (this is actually needed only for eval but anyways, we follow the instructions)

!git clone https://github.com/cocodataset/cocoapi.git
!cd cocoapi/PythonAPI; make; cp -r pycocotools /content/models/research/

Set the environment for all future operations. You need to run this part if you restart your runtime.

cd /content/models/research
!mkdir train eval%set_env PYTHONPATH=/content/models/research:/content/models/research/slim

Compile the model definitions

!protoc object_detection/protos/*.proto --python_out=.

Test that everything we need is installed

!python object_detection/builders/model_builder_test.py

Get the Oxford pets dataset

This is a lengthy part. If you plan to play with this dataset it might be worthwhile to load it into your google drive as transfer speed from drive is much faster. Anyway, we follow the example:

!wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
!wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz
!tar -xvf images.tar.gz
!tar -xvf annotations.tar.gz

Build the Tfrecord files

!python object_detection/dataset_tools/create_pet_tf_record.py \
 --label_map_path=object_detection/data/pet_label_map.pbtxt \
 --data_dir=`pwd` \
 --output_dir=`pwd`
!ls *.record

The ls *.record is to show us that the files that were created actually have different name to the expected output in the example.

Get the pre trained model for transfer learning

!wget http://storage.googleapis.com/download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_coco_11_06_2017.tar.gz
!tar -xvf faster_rcnn_resnet101_coco_11_06_2017.tar.gz
!cp faster_rcnn_resnet101_coco_11_06_2017/model.ckpt.* .

Get and edit the config file for the model

We copy the config file from the model directory and fix all the mistakes in it…

!cp object_detection/samples/configs/faster_rcnn_resnet101_pets.config .
!sed -i “s|PATH_TO_BE_CONFIGURED|/content/models/research|g” faster_rcnn_resnet101_pets.config
!sed -i “s|/content/models/research/pet_label_map.pbtxt|/content/models/research/object_detection/data/pet_label_map.pbtxt|g” faster_rcnn_resnet101_pets.config
!sed -i “s|/content/models/research/pet_train.record|/content/models/research/pet_train_with_masks.record|g” faster_rcnn_resnet101_pets.config
!sed -i “s|/content/models/research/pet_val.record|/content/models/research/pet_val_with_masks.record|g” faster_rcnn_resnet101_pets.config

Tensorboard

Running tensorboard is a bit tricky on collab. It also can cause resource exhaustion and the machine to hang. I do not recommend using it. However, if you do want to run it, here is how:

First, “install” ngrok. This needs to be done only once until your runtime gets wiped

! wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
! unzip ngrok-stable-linux-amd64.zip

Run tensorboard in the bacground

get_ipython().system_raw(‘tensorboard --logdir /content/models/research --host 0.0.0.0 --port 6006 &’)

Create a ngrok tunnel

get_ipython().system_raw(‘./ngrok http 6006 &’)

Find the tunnel external interface URL

! curl -s http://localhost:4040/api/tunnels | python3 -c \
 “import sys, json; print(json.load(sys.stdin)[‘tunnels’][0][‘public_url’])”

Train the model

At last!!!! We run the training in the background so we can check on its progress

get_ipython().system_raw('python object_detection/train.py \--logtostderr \--pipeline_config_path=/content/models/research/faster_rcnn_resnet101_pets.config \--train_dir=/content/models/research/train 2>&1 /content/log/tb.log &')

Now, the Tensorflow train script runs forever and does not output anything to the console while doing that. You can either use Tensorboard to monitor its progress or look at the ./train subdirectory and see the checkpoint files written. For some reason only “loss” records were written to the event file and not accuracy. I have no idea why.

If you decide not to use tensorboard, the train subdirectory contains files like that :

-rw-r — r — 1 root root 441468952 May 10 07:47 model.ckpt-16192.data-00000-of-00001 
-rw-r — r — 1 root root 40519 May 10 07:47 model.ckpt-16192.index 
-rw-r — r — 1 root root 8827824 May 10 07:47 model.ckpt-16192.meta 
-rw-r — r — 1 root root 441468952 May 10 07:57 model.ckpt-16991.data-00000-of-00001 
-rw-r — r — 1 root root 40519 May 10 07:57 model.ckpt-16991.index 
-rw-r — r — 1 root root 8827824 May 10 07:58 model.ckpt-16991.meta 
-rw-r — r — 1 root root 441468952 May 10 08:07 model.ckpt-17807.data-00000-of-00001 
-rw-r — r — 1 root root 40519 May 10 08:07 model.ckpt-17807.index 
-rw-r — r — 1 root root 8827824 May 10 08:08 model.ckpt-17807.meta 
-rw-r — r — 1 root root 441468952 May 10 08:17 model.ckpt-18616.data-00000-of-00001 
-rw-r — r — 1 root root 40519 May 10 08:17 model.ckpt-18616.index 
-rw-r — r — 1 root root 8827824 May 10 08:17 model.ckpt-18616.meta 
-rw-r — r — 1 root root 441468952 May 10 08:27 model.ckpt-19421.data-00000-of-00001

The train job writes a checkpoint every ten minutes. the nnnnn in model.ckpt-nnnnn.index tells you what step it is. For me it did 5000–6000 steps per hour. I stopped the job after 18616 steps. probably training it for 40000 steps would be wiser. you have 12 hours, so assuming you can do the setup and wrap up in 1h, you have 11h which can train for almost 60k steps

To see the loss values without using Tensorboard, use (replace the XXXXXXXXX and YYYYYYYYY with the proper values from the most recent file in the train directory):

import tensorflow as tf
import refor event in tf.train.summary_iterator(‘train/events.out.tfevents.XXXXXXXXXX.YYYYYYYYYY’):
    for value in event.summary.value:
        if value.tag == ‘Losses/Loss/RPNLoss/objectness_loss’:
            if value.HasField(‘simple_value’):
                print(value.simple_value)

or use re.search with ‘loss’ to find all loss events

Prepare and download the trained models

Convert the last checkpoint into a model (replace NNNNN with the last checkpoint in the train directory):

!rm -r exported_graphs; mkdir exported_graphs!export CHECKPOINT_NUMBER=NNNNN; python object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path faster_rcnn_resnet101_pets.config \
--trained_checkpoint_prefix train/model.ckpt-${CHECKPOINT_NUMBER} \
--output_directory exported_graphs

Zip it

!zip -r exp_g.zip exported_graphs

As the file is very big, Collab won’t allow to download it using its snippets, so you have to use google drive:

# Install the PyDrive wrapper & import libraries.
# This only needs to be done once in a notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Now copy the file to your drive

# Create & upload a file.
uploaded = drive.CreateFile({‘title’: ‘exp_g.zip’})
uploaded.SetContentFile(‘exp_g.zip’)
uploaded.Upload()
print(‘Uploaded file with ID {}’.format(uploaded.get(‘id’)))

From the drive, it should be easy to download it into your computer and test it. You can also test it on collab. The github repository contains a simple program to test it, named test1.py (creative naming is my middle name!). Clone the file into the working directory and run it one some image of pets you wget from somewhere. The utility cells at the end of the notebook have the needed code to view out.png.

That’s all for today! This is my first ever post, so any comments and constructive criticism will be most welcome, as well as claps….

If this post works well, I’ll write another one that use transfer training on my own set on the smaller and faster mobilenet SSD!