Strata NYC 2017 and the State of Data Engineering

Published in

Red Ventures Data Science & Engineering

5 min readOct 10, 2017

One of the great things about working for Red Ventures is the investment they make in the development of their employees. Engineers are provided with plenty of resources for learning and development, as well as the opportunity to attend conferences. I was lucky enough to attend the O’Reilly Strata conference in New York last week, and came back with some great insights into the current state of the world of data engineering.

If you haven’t been to the conference before, or have never heard of it until now, Strata NYC is an annual conference on all things data. With over 180 sessions occurring over the course of two days, there was an incredible amount of content to absorb. The sessions I attended were centered around a data-engineering track, with a focus on data warehousing and architecture, streaming systems, and cloud. Of all the incredible sessions, keynotes, and vendor demonstrations, there were some common themes that paint a picture of the current state of data engineering, as well as point to where the industry is going. In this post, I’ll go over some of these themes and the related sessions, as well as provide some resources for further reading.

Cloud

Cloud was one of the threads that tied together nearly every session I attended and is becoming the foundation of modern data engineering infrastructures. Any startup founded in the last five years has almost certainly adopted a cloud-first architecture, and large, established enterprises are quickly migrating major workloads to the cloud. John Hitchingham, the director or performance engineering at the Financial Industry Regulation Authority (FINRA) described their transition from an on-premise data lake to a cloud data lake on AWS. The driving force behind FINRA’s move to the cloud was the high cost of scaling the infrastructure to handle peak data volumes. AWS (and other cloud platforms) provide some key advantages over on-premise architectures, most notable of which include the following:

Separating compute from storage
Easily scale-out to handle peak data volumes
Unification of access to all corporate data

One of the more recent trends in cloud computing is the adoption of serverless technologies. Ben Snively, a solutions architect at Amazon, gave a session on how AWS can be used to implement a serverless architecture for data storage and processing. In his talk, he first described the evolution of cloud architecture from virtualized servers, to managed servers, and finally to serverless. As it’s name implies, serverless tools do not require the provisioning or management of server resources, and one of the key benefits is that the user does not pay for time when the tools are not being used. AWS has extensive support for serverless analytic workloads, including object storage in S3, streaming in Kinesis, and compute using Lambda functions.

One of the most intriguing applications of serverless patterns to a data engineering workloads is in ETL. Numerous tools have been created by the major cloud providers that take the pain out of ETL. Amazon recently released Glue, which is a fully managed service for running ETL jobs in the AWS ecosystem.

Further Reading

SQL

One of the more subtle but profound themes of the conference was the fact that after over 40 years, SQL is still the dominant tool for data analysis. A recent post by Timescale CEO Ajay Kulkarni described the recent resurgence of SQL in a world where it seemed that NoSQL systems were destined to take over. His argument was centered around some of the key limitations of NoSQL, namely the lack of a standardized, cross-platform language. This trend was extremely evident at Strata.

One particularly interesting trend is the use of SQL in analyzing streaming data. Tyler Akidau, a software engineer at Google and co-author of the O’Reilly book, Streaming Systems, gave a fascinating talk on the foundations of streaming SQL. In the talk, he provided an excellent overview on the difference between querying tables and streams. The key point was that tables represent data at rest, while streams represent data in motion. While we can use the same SQL constructs to query both tables and streams, there is a fundamental adjustment that must be made when querying streaming data. Akidau distilled the difference between querying streams and tables quite nicely:

Tables capture a point in time snapshot of a time-varying relation. Streams capture the evolution of a time-varying relation over time.

The application of SQL to these new paradigms demonstrate just how flexible SQL is.

Further Reading

Presentation slides from “Foundations of Streaming SQL”
Kinesis Analytics (SQL on Kinesis streams)
KSQL (SQL on Kafka streams)

Streaming

As far as trending technologies in the data ecosystem goes, streaming is arguably the most prevalent. Kafka, the open source stream processing platform, is used by numerous companies to ingest billions of events per day. Michelle Ufford, a data engineer at Netflix, spoke on how Netflix drives automation in data engineering workflows. In her talk, she said that Netflix uses Kafka to process hundreds of billions of events every day. Obviously not every company has Netflix-sized data, and there are plenty of use cases for stream processing that make sense for smaller data sets. For example, one increasingly common implementation of stream processing is as a replacement for traditional GUI-based ETL tools in an effort to achieve near-real time analytics.

Streaming systems have applications beyond analytic workloads, and are often used for core business operations. Gwen Shapira, an architect at Confluent, gave an excellent talk titled “The Three Realities of Modern Programming: The Cloud, Microservices, and the Explosion of Data”. In the talk, she described how modern infrastructures should be built using microservices that communicate with one-another using a stream of events. Often referred to as event sourcing, this pattern enables an application to generate an immutable stream of events that can be used for in-depth historical analysis. Although implementing a streaming architecture has it’s advantages, there are high management and operational overhead costs, which is why Shapira recommended utilizing managed services when possible.

Further reading:

Final Thoughts

Data engineering is an incredibly exciting field to be in right now, and spending a few days learning about how experts in the field are solving problems and leveraging the latest technologies was invaluable. While many aspects of the industry are evolving, some things are staying exactly the same. ETL is one of the main areas undergoing momentous change, as complexity and overhead has decreased significantly with managed cloud resources and serverless architectures. Although the nature of data is expanding from data in tables at-rest (both RDBMS and NoSQL) to dynamic streaming data, SQL is still the dominant language for data analysis over 40 years after it’s inception.

In addition to the sessions, both days of keynotes were excellent, and there were some great networking opportunities as well. To top it all off, Cloudera sponsored a party at a midtown Manhattan rooftop garden that had a dance floor and an open bar.

Strata NYC 2017 and the State of Data Engineering

Cloud

SQL

Streaming

Final Thoughts

Written by Weston Sankey