Building a Serverless Dataset and AI Model Management Tool

11 min readNov 10, 2018

In my last article, I illustrated how we built a completely serverless application for brand detection in videos on the Google Cloud Platform (GCP). This time, I outline how we added an admin tool that lets the operators of the app manage the brand detection engine itself. I would have loved to call it “Episode II: The Server Strikes Back”. But it doesn’t. This is as serverless as the main application.

Motivation

Which serverless GCP product didn’t we use so far? Ah right, Cloud Datastore! So let’s do something with Datastore, shall we?

Admittedly, the reasoning behind that admin tool was somewhat different: When iterating through the training of our brand detection model based on the TensorFlow Object Detection API, we relied on a very hands-on approach. Which is in this case just a euphemism for a lot of manual work…

The Object Detection API gave us a nice and almost easy access to pre-trained neural networks and hence a jump-start to our own brand detection model. However, the training of the models still required a lot of manual effort: managing local conda environments, annotating images by drawing bounding boxes around logos, creating TFRecords data files on our local machines, editing Object Detection API config files, copying files to and from Cloud Storage, launching training and evaluation jobs on Cloud ML Engine, copying files back and forth again, exporting trained models, copying files back and forth again and again, and finally deploying models as a service on Cloud ML Engine.

All in all, a terrible mix of local and cloud-based data storage and code execution, and a lot of typing Cloud SDK commands such as gsutil -m cp ... and gcloud ml-engine jobs submit .... I even created a list of all the necessary commands and their correct order. At some point, I had written a couple of bash scripts to automate at least some of the steps, but still: unacceptably tedious.

The worst thing besides the lousy user experience was, however, the mess it created. If you’re not endowed with an accountant-esque discipline and hence do not embrace thorough bookkeeping (like me), you’ll find yourself quickly juggling a lot of different datasets and model checkpoints, losing track of which model was trained with which dataset, based on which checkpoint for what reason and producing what results. Even worse, as we were lacking a proper checkpoint management, we found ourselves overwriting older model checkpoints with newer ones. No reproducibility, no traceability. And certainly far away from best practice.

As a GCP evangelist (yeah, I hate that expression, too, but look: pun ahead!), in such situations, I often ask myself WWGD (what would Google do)? Well, I guess they would probably let machines do the annoying stuff and automate humdrum things away. And wrap everything into a proper app. So, what would it take to app-ify this whole process of handling data and models?

The Requirements

To be more precise, we wanted our admin tool to allow us to

upload and annotate training data;
build TFRecords datasets;
launch training jobs on Cloud ML Engine;
deploy models to the Cloud ML Engine service;
keep track of the model checkpoint history and hierarchy;
have a cloud-only solution with a browser-based UI.

Our main app already had an Angular frontend and a Python backend running on App Engine. Also, we had all the Python code necessary for the TensorFlow related tasks like the dataset generation. So it all came down to deploying our existing code in a reasonable way, adding a database solution, using the APIs to interact with Cloud ML Engine and other GCP components, and integrate the admin tool into the main app. That seemed pretty doable.

We decided to use Datastore as the database because it is ready to use a.k.a. serverless and perfectly suited since all we want to store is basically key-value pairs. Let’s have a quick look at the data model before we go more into the details of some of the components.

The Data Model

Datastore is very much at the centre of the tool. Every object we manage — brands, training images, label maps, datasets, models — is represented by an entity, organised in six kinds:

Brands and Logos: First, we need a notion for the objects we want to detect. In our case, these are brands or, to be more precise, logos. Oftentimes, a brand has more than one logo, e.g. a word mark and a figurative mark or even multiple logos. Therefore, we distinguish between logos (the things we actually train our models to detect) and brands (what our analytics are based on) and introduce a one-to-many parent-child relation in Datastore.
Images: By design, neural networks are trained by showing them a lot of examples to learn from. In our case, each training example consists of an image file and an annotation defining the logos (IDs from our Logos kind) and their locations in the image (coordinates of the bounding boxes). Annotations are represented as an array of embedded entities (think of a JSON object). The image files themselves go to Cloud Storage.
LabelMaps: Since classifier models work with numeric classes rather than labels (“Adidas”), we use label maps to — surprise, surprise — map labels to class ID integers. In our Datastore kind, a label map is represented as an entity containing an array of logo IDs. The array index plus 1 corresponds to the class ID. We can create label maps for every set of brands we want to train a separate model with, e.g. one for the sponsors of a football league and another one for alpine skiing.
Datasets: To ingest the training images and corresponding annotations efficiently into a TensorFlow model, we want to bundle them in a single TFRecords dataset file. Of course, we only include images that contain logos relevant for a particular training run. We’ve defined such a set of relevant logos just before when creating a label map. By including the label map ID in the dataset entity, we keep track of which logos the corresponding TFRecords file contains.
ModelCheckpoints: One advantage of using the TensorFlow Object Detection API is that you don’t have to build and train your object detection model from scratch. Instead, you start with a pre-trained model and fine-tune its weights with your data which saves training time. If later, you want to improve your model further e.g. because you have new data, you can now start from your last model run. Hence in our case, every model has an ancestor. We mirror this model hierarchy in Datastore using a one-to-many parent-child relation. Therefore, every ModelCheckpoints entity has a parent property along with a property for the dataset ID that was used for this particular model training run. The TensorFlow model checkpoint files themselves are stored in Cloud Storage, of course.

Datastore connects the bits and pieces. The model checkpoint “knows” which dataset it was trained with; the dataset knows which label map it is based on; the label map knows which logos it contains. When a video is being analysed based on a particular model in the main app, it’s easy to reconstruct how the numeric class IDs returned by the model have to be mapped to logos and hence which brand names have to be written to BigQuery.

Here is how Datastore is connected to the rest of the admin tool architecture:

Let’s now have a closer look at some of the components. I skip the brand/logo and label map management components as they are quite trivial; it’s really just more or less creating Datastore entities.

Image Upload and Annotation

As mentioned in the requirements, the image management component is supposed to do more than just upload images to Cloud Storage and list them. It comes with fully fledged annotation functionality so we don’t have to use third party tools on our local machines anymore. Instead, we can select and display images after the upload and draw bounding boxes around a logo in the browser and define which logo it contains (by assigning a logo ID to it). It also allows us to modify or delete existing bounding boxes.

Image annotation tool (training image taken from a YouTube video)

Annotating images is still a heavily manual task of course, but at least it is now integrated into one single tool and process. Unlike before, it uses tool-wide consistent IDs for the logo classes. Also, images are queryable, e.g.: Which images have no annotations yet? How many examples with the Adidas logo do we have? As a nice side effect, we can easily add examples of the logos to our brand/logo management section by just querying the bounding box information from Datastore and cropping the images accordingly:

Brand and logo management with logo samples cropped from training images

So far, so good, but I reckon that a serverless solution somehow lacks coolness without Cloud Functions. Luckily, we could mitigate this flaw: To ensure every image conforms to the requirements of the TensorFlow models (in particular, they need to have three colour channels), we extended the upload process with a Cloud Function that automatically converts every image to RGB if necessary. At the same time, it also extracts metadata like the dimensions.

After deploying this, we thought it might be fun to use Google’s Vision API to add a bit of context to every image (usually sports scenes). Therefore, we wrote another Cloud Function that sends every image through the Vision API and adds the detected labels (e.g. “ice hockey” or “skiing” or… “galliformes” — er, what?) to the Datastore entity. We’re not sure yet how useful this feature really is, but since almost everything on GCP is just an API request away and writing a Cloud Function is super easy and fun, why bother. Let’s just try it out and see whether it proves to be useful or not. And so, images are now searchable for brands and context. Summarising, here’s how the whole image upload and processing pipeline looks like:

Dataset Generation

Once the annotated images are ready together with the label maps that define sets of logos to train a model with, we can pack the training data into a TFRecords file. First, we query the information we need from Datastore: We start with the ID of the label map of interest, query its constituents (logo IDs) and then the images containing these logos together with the bounding box information. Second, we pull the required image files from Cloud Storage, pack everything into one TFRecords file, and ship it back to Cloud Storage. All with a single click on a button on the frontend and with the backend running recycled Python code from before when we still did all of this on our local machines.

Model Training and Deployment

Having the datasets at hand, we are now ready to perform the actual model training. As mentioned, our models have a hierarchy, forming a tree.

Every tree starts with a pre-trained architecture from the Object Detection API model zoo at its root. Uploading such a model creates a representation of the model in the form of a Datastore entity as well as a directory on Cloud Storage with the checkpoint files. From here, one can create a child model by forking it. Every fork creates another model entity and a copy of the checkpoint files in its own directory. To train it, click on the respective button, chose a dataset, set the training parameters, give it a description, and the training can start. The backend will write the necessary Object Detection API config files to Cloud Storage and then launch a training job on Cloud ML Engine (CMLE). Depending on the commissioned machinery, model architecture, and innovations introduced by the dataset, the job may run for a couple of hours. Therefore, the status of the job is queried periodically via the CMLE API to check whether it is still running or has completed.

Not happy yet with the accuracy of the model? Just fork again and continue with the training. Since every forked model starts with the weights of its parent, it can build upon what its ancestors have learned and, therefore, usually converges faster.

Once the model has reached satisfactory accuracy, one can deploy it so that it can be used in production. With a single click, the backend takes care of exporting the model to the right format and of triggering the deployment on CMLE. This actually creates a service on CMLE that can be used by the main application via simple API requests.

Conclusion

In contrast to what we had before, most of the hassle has now disappeared behind a frontend with some buttons, input fields, and lists. Tedious manual work steps have been replaced by a couple of clicks which saves us time and nerves. The action is consequently taking place in the cloud and not partially on our laptops anymore. And we have gained structure and traceability. At the same time, its serverlessness makes it very operating cost-effective.

And the development costs? Well, the most time-consuming part was the frontend. Everything backend — designing the data model, modifying and deploying the existing code, and adding API requests to it — was a matter of a couple of days. We see it as yet another great example of how a serverless cloud infrastructure helps us develop solutions quickly and with little effort.

Epilogue

But, but… did we do all of this just for this one logo detection use case? Isn’t that a bit over-engineered? Well, yes and no. Of course, it felt great to take the hassle out of the whole data and model handling process. But we also hat two other things in mind.

First, we planned to introduce a completely new type of object detection model. One that takes a two-step approach, separating object localisation from classification. The first step would just learn to distinguish the logos from the background, basically placing bounding boxes around what seems to be a logo. For each of these, the second step would then calculate a feature representation and compare it to a set of known feature representations to classify it as one particular logo. This has the advantage that you can handle untrained logos way better because it allows ex-post classification in case the classifier step couldn’t find a match among the known logos.

Say you have a video featuring the Rivella logo which our model has never seen before (probably like you, unless you are Swiss). The localisation model would still recognise it as a logo because it has the abstract features of a logo. It just doesn’t know what logo, so it saves its feature representation and flags it as unclassified logo. The admin can then go through all video frames with unclassified logos and assign the right logo to the proposed bounding box. Not only for this frame but for also all past and future ones since different examples of the same logo should produce a very similar feature representation. See how this integrates nicely into the admin tool with its application-wide logo management and annotation feature?

Second, we see it as some sort of blueprint for future projects and products. This is by no means restricted to brands and logos or this kind of object detection models. With a few tweaks, it will be applicable in another context. Hence, we see it as a general tool for other use cases and well worth the exercise.