Wrangle your scattered datasets and pipelines with Airtable
Level up your data platform: track all of your datasets in Airtable and hook them up to your pipelines through the API.
My field is bioinformatics, which is essentially data science and engineering for genomics and next generation sequencing (NGS). There are tons of bespoke commercial and open-source tools for managing genomics data and workflows, but I tend to avoid the many domain-specific, all-in-one bioinformatics platforms, preferring instead to build a stack of best-in-class solutions from the world of IT and data science. A system of components that do one thing well tends to be more flexible and long-lived than a monolithic, one-size-fits-all solution that tries to do everything.
I’ve written before about what I believe are the major components of a technology stack for scalable data pipelines. However, my proposed stack has had one component for which I hadn’t given any specific recommendations: the “metadata store,” i.e., the database at the center of your data platform. This was particularly ironic, because the metadata store was probably the most crucial piece in allowing our biotech startup to scale up the amount of data we could comfortably manage. Our solution was a custom Laboratory Information Management System (LIMS), but really any database could work. Unfortunately, databases are rarely used for this purpose, at least in the context of small- to medium-scale genomics projects within small biotech, academia, and (occasionally) big pharma. Instead, metadata is often scattered in Excel files, emails, and folder hierarchies throughout the company.
Our metadata database was probably the most crucial component in allowing our startup to scale the amount of data we could handle
What is metadata?
Metadata means, literally, data about data. Where did the data come from? How was it generated? Where does it live? What is the format? In the case of bioinformatics, what samples were in a given experiment and what were the different treatment conditions? If the data are stored on Amazon S3 or a shared filesystem, what are the filepaths or URIs?
If you’re a good data scientist, you probably already keep your metadata somewhere. Maybe it lives in a README or Excel file within your data folder, or maybe there is a csv or JSON file that you load and use with your data analysis, for example, to define groups for comparison. In bioinformatics, your metadata might link filepaths and sample IDs to treatment groups, like:
The purpose of metadata is to quickly get an idea of what the data is about, without having to dig into the dataset itself. In my field, metadata is particularly crucial, because datasets can come from all over — internal projects, public data repositories, supplemental data from publications, and more — but we have to be able to analyze and cross-compare them all using the same data pipelines.
The metadata store is the heart of your data platform
There have been a few distinct points in my career when the amount and variety of data in our hands had begun to become unwieldy. Each time, the turning point for getting it under control was when we harmonized our metadata and stored it in a central database. A metadata store:
- lives in a single, centralized location, visible to everyone
- persists forever
- is easily browsable and searchable
- assigns a unique identifier to every dataset
A metadata database provides a comprehensive, standardized data registry of all of the datasets at your disposal. Data that is registered in the metadata store is “internalized,” everything else is “external.” We had strict rules for this, and we wouldn’t peer-review a data analysis until the metadata was in the database. This forced all scientists to use a shared ontology, since it could be enforced by the schema. For example, although cell line nomenclature varies widely, our analyses would be forced to use the same name, because the “cell lines” column links to a specific record in the “cell lines” table. If an API to the metadata is available, your analysis code can fetch the names from the database, to use as labels in figures and tables. This standardization becomes incredibly useful as your repository of datasets grows, and adds precision when you need to re-analyze the data months later.
Another benefit of a metadata store is that record identifiers become useful, unambiguous shortcuts to refer to a particular piece of data. For example, we would often have multiple, slightly different sequencing experiments with very similar samples and conditions, so communicating about these results was often confusing and ambiguous. But once we gave every dataset its own ID, we naturally began referring to an experiment or sample by its number (“experiment E-17”, or “sample S-2542”), and there was no ambiguity.
Having metadata at our fingertips allowed us to stop auto-detecting data formats. Much of the effort in building a one-size-fits-all pipeline is spent writing conditionals and auto-detection code to allow different data types as input. For example, our RNA-seq quantification tool needs to know if the data is stranded or unstranded, and what the read length is. You have the code inspect the data directly to gather this information, but it’s much easier just to include this information in the metadata record and query the database for it. As a result, we were able to streamline (i.e., throw out) tons of code.
Finally, metadata is useful for completely automated pipelines. Even if you have a fully automated script, it still needs to know where to find the input files. A database can store all of the samples and file paths for a given experiment. Then, all the user needs to input is the experiment ID, and the pipeline can do the dirty work of querying the database to retrieve the file paths. Even better, database records can be created before the data are generated (right after the experimental design, for example). Then, a workflow manager like Luigi or
make can be run in a cron job to periodically check for existence of the input files, running the full pipeline automatically as soon as they arrive.
This task of having to batch-process a wide variety of unstandardized data sources, formats, and filepaths may seem strange to data scientists within many other industries, but it is extremely common in bioinformatics. Also, the concept of using a database for metadata may be obvious to software engineers, but it’s important to note that bioinformaticians are often of a different pedigree. Many are great programmers, but most are either self-taught or were educated on a curriculum that tread lightly on the subject of relational databases, if at all. The truth is that databases can be viewed as a dark art by those on the data analysis side. Data scientists in big enterprises may live by SQL, but I would bet 9 of 10 bioinformatics scientists in biotech and academia manage their metadata exclusively via emails, Excel, directory structure, and READMEs, rather than databases.
Let’s fix that.
There is a good reason why few in my field use databases — it’s a steep learning curve. You have to figure out how to model the data, deploy the database, populate with existing data, administer backups and updates, and possibly write an ORM layer if your SQL is rusty. And that’s just for it to be useful to the developers. In order for it to be useful to non-technical users, you have to add a web layer with CRUD (create, read, update, delete) capabilities. As a result, non-technical users will generally prefer to keep data in an Excel file on a shared drive, and there is a huge activation energy to move the needle when deciding on Excel vs a database.
Airtable moves the needle. After one day with Airtable, your “Excel or database?” decisions will emphatically land on database, with Airtable as the tool. It provides a web-based interface that is as easy to use as Excel, but has the soul of a database and a beautiful web API as well. Your organization may still eventually want to migrate to a full-fledged Postgresql database and custom web app for production, but Airtable will get you a fully functional prototype and web GUI within minutes, with 95% of the features you need. With less than a day’s work, you’ll be able to hook up your existing pipeline scripts to the Airtable APIs for a level of automation and integration that rivals commercially available data platforms.
Stateless data pipelines
The metadata database is a useful place to store output summary metrics, closing the loop between metadata, a stateless analysis pipeline and output results.
When dealing with big datasets, the best practice for a data pipeline is to keep your pipeline code as stateless as possible. That way, you can easily migrate to a Docker container and/or a server in the cloud, without having to change anything in your code. Many bioinformatics pipelines use cloud-based object stores such as Amazon S3 to allow data to be accessed anywhere, with high availability and unlimited capacity.
However, S3 is not the most convenient place to access structured outputs such as summarized data, QC metrics, and figures. The metadata database is a useful place to store these structured outputs, closing the loop between metadata, a stateless analysis pipeline and results.
In a perfect world, all of the stateful aspects of the pipeline would reside in either a database (for metadata and structured data), and S3 (for big, unstructured datasets), while your stateless pipeline steps access all types of data in exactly the same way (via HTTP/S API calls), from any computer with internet access.
Example: a toy Airtable/S3 pipeline
Now I’ll show you how I built a simple, one-step QC pipeline for NGS data using Airtable and Amazon S3. This requires an account for Airtable and Amazon Web Services, as well as access to a Linux terminal (I have a Mac).
Set up the Airtable database
I created a database — or “Base” as Airtable calls it — named “NGS experiments , and created tables for Experiments and Samples. I then added columns for Platform (either RNA-seq or whole exome), Description, Vendor, and Date received to Experiments, and specified the correct data types using the “Customize field types” dropdown. Then, I created a new Auto Number field called ID, added a formula to Name (the primary field) that gives the prefix “EXPT-” to each ID. Later I hid the ID field, but it shows up in the screenshot below for clarity. Then I made up some data.
Next, I created some fields for the Samples table and populated with some made-up data, although I tried to make this toy example representative of the diversity of data types and filepaths that a bioinformatics group might typically see. For example, TCGA, or The Cancer Genome Atlas, is a publicly available data repository for cancer genomics gathered from thousands of patients. For simplicity, we assume the reads have been aligned already, so each sample has only one data file (the BAM), although in reality raw sequencing data usually comes as a pair of FastQ formatted files.
Collect metrics from the data and store in the Airtable record
Below are the functions for a simple, one-step pipeline that downloads a file from S3, runs
samtools stats on the temporary local file, and updates the Airtable record with the sequence count. The functions
update_sample were omitted for now, as well as some of the initial setup.
This code relies on helper functions to query the Airtable API. The API endpoint is available through the API docs generated specifically for each Base. With the help of the Python
requests library and the beautiful Airtable API, this part is a cinch. Note that the sample_id variable (e.g., “SAMPLE-1” is actually the “Name” field, and not the same as the “id” of the Airtable record, which is an autogenerated primary key.
The full script is available as a Github Gist.
Now, if you add the “Total reads” column to the Samples table, and upload a BAM file to S3, running the command:
$ python pipeline.py SAMPLE-1
should download the file, run samtools, and automatically update Airtable. And it worked! Check it out in the screenshot below.
Free your code from the filesystem
You may have noticed that, unlike many data pipeline scripts, there is very little code to deal with folders and filepaths. Notice that I didn’t even import
shutil! I only used the
os module to get environment variables. I’m not saying you’ll never need them, but the Airtable/S3 framework allows you to avoid them as much as possible.
Dealing with folders and files is the worst part of building a data pipeline, and distracts from the high-value, domain-specific code. When dealing with 3rd-party Linux tools, you can’t abstract the filesystem completely, but using S3, the Python
tempfile module, and Airtable gets you pretty close. From a deployment perspective, the code in this example only needs to ensure that your environment has the right dependencies installed (a solved problem when using Docker), a few environment variables are set, and that the
/tmp folder used by
tempfile is big enough to handle your data. If all these are met, then this code is completely portable.
A bit more advanced, but not by much, is to upload an image to S3 and use the API to link to the Airtable record as an attachment. Airtable has nice native capabilities for image attachments, for example, automatically making thumbnails for beautiful in-line display in the records table.
In addition to a shortcut to fancy Airtable features, this example framework can serve as a building block for much more complex pipelines, as you start to incorporate the other pieces of infrastructure such as a workflow manager (Luigi), environment management (Docker), and job scheduling (Amazon Batch). Below is a list of possible steps to harden and “productionize” our toy data pipeline (aside from domain-specific stuff, like more advanced preprocessing, QC, and downstream analysis).
- Image attachments in Airtable
- Docker to manage environment dependencies
- Luigi to break into an idempotent workflow of interdependent tasks
- Amazon Batch to submit as horizontally scalable compute jobs