This week we attended the DataEngConf at Columbia University in New York City. We’ve previously written about how leading ML algorithms require modern compute, data pipelines, and workflows. The DataEngConf conference underscored that software advancements have put pressure on infrastructure teams to innovate. The conference had three tracks: data engineering, data science and analytics, and AI products. Our piece discusses the key topics from the data engineering track including data growth, ETL, containers/Kubernetes, and schedulers.
Data keeps growing, but growth spurts aren’t easy. IDC’s paper “Data Age 2025” forecasts that the global datasphere will grow from 16.1 ZB in 2016 to 163 ZB in 2025, a tenfold increase. Businesses often describe data as their crown jewels, but handling data can be hard. Jean-Mathieu Saponaro of Datadog highlighted numerous data challenges including 1) highly diverse data sources (formats, types); 2) ever-evolving data sources (internally generated, third party); 3) constant demand for new sources; 4) various levels of data sensitivity and access, 5) long-term persistence, and 6) backfilling.
Another data challenge echoed across talks was data lineage, a data lifecycle including the data’s origins, what happens to it, and where it moves over time. Willy Lulciuc of WeWork stated that metadata can capture the origin of the data, data owner, change frequency, and pipeline stages it underwent. His talk dug into Marquez, a service for the collection, aggregation, and visualization of a data ecosystem’s metadata.
ETL? More like ET “Hell.” Most businesses use Extract Transform Load (ETL), an established system where an operator pulls data, performs transformations on one giant piece of code, and then loads the output to a data warehouse. Saponaro claims “ET Hell” isn’t an ideal solution because evolving data sources force data engineers to adjust the whole pipeline; there is low resilience to change in data sources; task dependencies are nightmarish with one failure causing the whole pipeline to fail; and backfilling takes a long time.
Instead Datadog uses an Extract Tiered Transform Load (ETTL) process to solve many data challenges. ETTL is an augmented form of ETL. It breaks transformations down into three steps that each persist the data: bronze, silver, and gold. The bronze step brings data from all sources into one place. It is a raw extract. The silver step is a normalization layer that allows 1:1 mapping of objects. It supports filters, data cleaning, column selection, renaming, and type casting. Finally, the gold step is the analytics layer leveraging Spark to create the final objects that are loaded into the data warehouse.
ETTL solves many traditional ETL issues. The bronze step helps engineers incorporate diverse data sources because the bronze executes one base task per data source and Datadog created an abstraction to easily add new datasets. Silver mitigates issues from changing data sources and new sources because objects are separate until silver. The silver stage transforms objects into the same form and takes care of column evolution. Tiering makes it such that if the logic is incorrect or objects need to be adjusted, data engineers only need to backfill from the tier impacted by the change. The entire pipeline does not necessarily need to be fixed. There is also data persistence since the bronze stage and data warehouse tables can be backed up. Sensitive data can have controls layered on through IAM solutions and/or BI tool permissions.
Containers and Kubernetes extend reach. Data platform teams, like Lyft, are beginning to leverage containers and Kubernetes to create self-serve end-to-end ML platforms. Containers solve dependency management because they are self-contained. They are framework-independent and portable, allowing teams to be cloud-agnostic leveraging the best hardware advancements across clouds from TPUs to Nvidia GPUs at the most effective price point.
Kubernetes provides rich APIs for launching complex workloads and can scale out. Saurabh Bajaj of Lyft noted that, “Kubernetes has blown up,” but it isn’t without drawbacks. Kubernetes is complex with low-level APIs that most users don’t want to use. Lyft created abstractions to make it easier to launch workloads.
Lyft’s bleeding edge platform uses containers and Kubernetes, but initially speed was an issue. Launching a notebook, a new job, or instance could each take over 5 minutes. According to the New York Times, a ML engineer can be paid from $300,000 to $500,000 annually. The Pew Research Center found that in 2015 the average work week was 38.7 hours a week and an individual worked 46.8 weeks that year. Applying this to the $500,000 ML engineer salary suggests that every minute wasted is a $4.60 loss to the business. In turn, the aggregate lost productivity can add up quickly for each person and team.
One of ML images biggest challenges is their size, which can slow down processes. Average notebooks are 12–15 GB because they often contain Nvidia CUDA drivers, framework binaries, etc. Loading a new job can become a bottleneck because it takes a long time to download the images. To solve this Lyft created a warm cache on every node in the cluster. They also pre-cache the most commonly used images.
While Kubernetes has autoscaling, starting a new instance can still take awhile. Lyft uses Packr to generate AMIs. Packr helps Lyft pre-bake images and cut down on the build time. Kubernetes has priority classes to create lower priority jobs and specify extra capacity for jobs waiting to be preempted. This technique allows Lyft to setup notebooks fast, making data scientists and ML engineers more productive.
Schedulers are key. During the conference speakers emphasized the need to use schedulers. Attendees referenced Luigi and Airflow the most. Spotify developed the Luigi python package that helps build pipelines for batch jobs. It holds logic between the tasks and chains them together. Maxime Beauchemin created Airflow which uses operators as the unit of abstraction for tasks and defines workflows as DAGs. GitHub star growth suggests strong community support for both solutions with Airflow exhibiting impressive recent adoption rates.
DataEngConf reinforced that algorithm advancements are driving data platform innovation. Data growth continues to increase, but data integration and management can be hard. However, multi-step ETL processes, containers/Kubernetes, and schedulers help make data engineers lives easier.