Manage your Image Dataset with Google’s AutoML

Published in

[:RI:] @ REWE digital

4 min readSep 6, 2019

During the course of the last few months, we did approach multiple computer vision projects. And one major question continued to pop up with every project. How do we manage the dataset? That includes topics such as how to format labels, and where store images.

Moving training towards the cloud is unavoidable at this point to achieve reasonable training times. Since we decided to work with the Google Cloud Platform, the only solution was to store files in a Storage bucket. This also is not optimal, because there is no preview of images to evaluate the dataset on the go. Additionally, label files have to be maintained by custom tooling.

After a few projects, we did feel the need for a dedicated dataset management tool, and came across Google’s labeling tools. With most of our data already labeled, we had no use for Data Labeling. The other option is Google’s AutoML Vision. Its main purpose is to allow easy-to-access model training. Additionally, it offers functionality to manage datasets via the browser. You can add (labeled and unlabeled) images, preview and label them. Data is split into training evaluation and test set automatically, if you do not include the set manually. This allows researchers and engineer to easily evaluate datasets at any point in time. Also people with no ML knowledge can maintain those datasets. This is especially helpful if you have datasets not suitable for crowd-labeling, but only for internal use.

However, what if you don’t want to train/deploy with AutoML, but only use the dataset management part? Of course you can do that! We will walk through an example on how to train a single label image classification custom models with a dataset, managed by AutoML. All code can be found on github https://github.com/ri-rewe-digital/automl-training.

Export your dataset from AutoML

We assume that you have an AutoML Vision dataset for image classification already. What we need first is to create a support account that can access both AutoML and Storage. Store the key file in a location of your choice.

Instead of exporting the dataset via the web UI every time, we want to automate this process, using a REST interface. Fortunately, Google provides a library in several languages that allows us to do so.

Export a dataset from AutoML

This exports your dataset as a csv file into your bucket and downloads it after the export is complete. The csv file has three columns, dataset (‘TRAIN’, ‘VALIDATION’, ‘TEST’), filename (path to file in bucket) and label (formatted as string).

Create dataset iterators

Next we read the csv file into a Pandas dataframe and convert its labels to a numeric value. This is required to later one-hot encode the labels. Pandas category function allows to convert the string label to a number.

Read an exported dataset with Pandas

To train, evaluate, and test, we split the dataframe into separate iterators. Evaluation and test have different requirements and are created with a slightly different configuration.

Create a TensorFlow Data API iterator

We export pre-processing functions to separate lambdas, in order to achieve better readability. At this point we need two additional steps in pre-processing. The first one is one-hot-encoding of labels

One-hot encode numeric labels

and the second step consists image loading, normalization and resizing.

Load image files

This is only very basic pre-processing, it is recommended to include additional functionality like augmentation. We now have a re-initializable iterator with shuffle, repeat and prefetch.

Since tf.io.read_file can access images from Google buckets, we do not download the whole image-set before we start processing. Keep in mind, that this is only to keep the code shorter for this tutorial. Streaming data is very slow compared to pre-downloading all images!

Build your model

To simplify code, we apply transfer learning on a pre-trained model, provided by Keras.

Build a Xception classification model with transfer learning

As base model we use Xception, pretrained with ImageNet, but do not include the top layer (a dense layer with a Softmax activation). You can replace this model with any other you prefer. The only limitation is that you include Keras from TensorFlow (tf.keras.*) instead of plain Keras. The reason is, that plain Keras does not work with iterators out of the box. You would have to wrap the iterator in a generator, which is not included here.

Train your model

With data and model ready, we can start training. First, load the csv file, split the data, and create your three set iterators.

Create train, evaluation and test iterator

Then create and compile the model and add some basic callbacks to support and visualize the process. We use a regular Adam optimizer with a categorical cross-entropy as loss function. This is a very basic setup that mostly leads to reasonable results without much tweaking.

Create and compile your model

And finally train and evaluate the model (both done in model.fit). After training, we test it (model.evaluate) with the test iterator.

Since Keras does not read the iterator size itself, set the steps for each iterator manually by computing it based on your chosen batch size.

Train, evaluate and test the Keras model

You now trained a custom model with your dataset managed by AutoML.

Conclusion

Managing datasets in AutoML Vision offers an easy to use tool to maintain datasets in the cloud. However, AutoMLs currently limited training functionality did not satisfy all our requirements. Therefore, we came up with a way to export the dataset to train custom models.

The code here only provides a simplistic version of what you can do. Parallel pre-downloading of images instead of streaming, caching or image augmentation are just a few improvements you can add.

References

Code sections taken from public available tutorials

https://www.tensorflow.org/beta/tutorials/load_data/imageshttps://www.tensorflow.org/guide/datasetshttps://cs230-stanford.github.io/tensorflow-input-data.htmlPhoto by Jon Tyson on Unsplash