Aquarium: Better Models Through Better Data

Published in

Sequoia Capital Publication

3 min readFeb 24, 2021

We’re excited to be working with Peter Gao, Quinn Johnson and the team at Aquarium! Aquarium is a data management platform for Machine Learning (ML) that enables ML teams to manage and curate their training datasets.

Machine Learning is clearly one of the most important technology trends of our time. After the hype, ML is finally starting to creep into a wide range of industries — manufacturing, agriculture, energy and, of course, automotive.

While ML is increasingly ubiquitous, most people still have a relatively simplistic mental model for how to build an ML model:

Step 1: Collect and label as much data as possible
Step 2: Train a model with that data
Step 3: Fiddle with some knobs (i.e., hyperparameter tuning)
Step 4: Deploy

If you’re just training a hot dog detector, then this might be all you need, but most tasks are more complicated.

Consider any autonomous vehicle (e.g., a car, tractor, drone, etc.). A single camera might record 30 frames/second, capturing about 2.6M images per day. Now imagine multiple cameras on multiple vehicles operating 24/7 — the amount of raw data any vehicle might capture is immense. Practically speaking, it’s impossible to label all this data, let alone train on it.

The simplest solution would be to randomly pick images to label, but this is non-optimal — common things would still be common and and rare things would still be rare. Ideally, you want to build a data set that captures all of the situations you might encounter, not just the common ones.

In reality, the above mental model was too simplistic. Collecting data is not enough. You have to be thoughtful and careful about what data you train on. A better model would be:

Step 1: Collect as much data as possible
Step 2: Curate an initial training data set
Step 3: Refine your model
— Step 3a: Label any unlabelled data in the training set
— Step 3b: Train a model
— Step 3c: Observe failures in that model
— Step 3d: Update your training data set
— Step 3e: Repeat until satisfied
Step 4: Deploy

Enter Aquarium. Aquarium is effectively a platform for managing steps 1–3. It enables ML engineering and operations teams to manage and curate their training data and then iterate on their models until they’re ready for production.

While it’s tempting to fixate on the latest and greatest models coming out of OpenAI or FAIR (and that work is great!), the hard work for most companies is this constant curation of training data. This work is where models go from theory to practice.

Peter and Quinn both know this problem well. They spent years at Cruise Automation where they worked on the ML models for self-driving cars. It was there they realized that the core work of surpassing human-level performance and safety is this constant re-curation of training data. They founded Aquarium to bring those learnings and tools to the broader engineering community. After less than a year since founding, Aquarium is already working with over a dozen customers like Sterblue and AMP Robotics to help them manage their training sets and, ultimately, improve their model performance in production.

As ML technology becomes increasingly accessible, Aquarium’s ultimate goal is to enable non-technical domain experts to build and refine models over time, whether that’s radiologists focusing on early cancer detection or botanists focused on increasing crop yield. We are grateful to be on this journey with Peter, Quinn and team! If you want to learn more, check out https://www.aquariumlearning.com/.

Aquarium: Better Models Through Better Data

Written by Mike Vernal