What We Didn’t Solve at DataEngConf

Takeaways from San Francisco

Chris Merrick
4 min readApr 15, 2016

Last week I joined 250 other data scientists and data engineers at DataEngConf — a two-day, two-keynote event in San Francisco. It brought together data practitioners from tech companies big and small, mostly from the Bay Area. An awesome feature of the event was “office hours,” small-group sessions held with speakers post-presentation. I found this to be particularly valuable as a way to draw parallels between talk topics and the problems I solve day to day.

Here are a few of my main takeaways from the event.

The interface between data science and engineering is blurry

Even though the theme of the conference is bridging the gap between data science and data engineering, it was clear right from the first keynote that we wouldn’t solve that problem completely in just two days. Josh Wills from Slack got a lot of head nods when he introduced the “Infinite Loop of Sadness” to describe the interplay between data engineering, data science, ops and business.

While we didn’t solve the problem, I think we made some headway:

  • Josh described his attempt to find common ground at the ETL layer. He is building a javascript framework for defining transformations that emphasize usability for the SQL-oriented, along with portability to multiple stream processing engines like Spark streaming and Flink. Why javascript? Because his team knows it well, and Java 8 introduces high performance native support for javascript in the JVM, where most stream processing engines run.
  • Nathan Towery of Netflix shared his experience creating better personal dynamics between data engineering and data science teams. These problems aren’t solved with technology; they’re solved with communication and alignment around shared goals. His advice was to build a team culture that is “highly aligned/loosely coupled.” Highly aligned/highly coupled teams get good results, but work slow. Loosely aligned/loosely coupled teams are fast, but high on chaos. Highly aligned/loosely coupled teams have clear, shared goals; but they’re free to figure out the “how” in achieving those goals. While these rules aren’t specific to data teams, it was fascinating to hear Nathan explain how they apply to Netflix’s data teams.
  • Tommy Guy shared the “Data Asserts” system that his data science team at Microsoft uses to set clear expectations about data quality with their engineering team. My favorite quote from his talk was, “Data scientists are terrible at engineering, we don’t do code reuse very well. Writing tests helps force us to modularize our code.” As a data engineer with a soft spot for analysts and scientists, I’m incredibly excited about improving this interface with tools and workflows borrowed from software engineering. I recently launched an open source project with a colleague of mine, aimed at helping analysts modularize, package, and test their analytics code.
From Tommy Guy’s presentation

Our tools will be different next year

Many of the technology-specific talks were about products that barely existed a year ago, like Kudu and TensorFlow. Moreover, there were questions about whether relatively well-established technologies like Spark could soon be superseded by newer offerings like Flink. Regardless of what you think, it’s clear that the universe of tools is evolving every day. Consequently, many of the best talks focused on approach over implementation.

As a general observer, it’s clear that the landscape is evolving into an ecosystem of composable tools at each layer: distributed file stores (HDFS, S3, Kudu), storage formats (Parquet), wire protocols (thrift, avro), and computation engines (Spark, Flink). Projects like Apache Beam or the aforementioned javascript transformation framework may even provide a common language for defining operations across computation engines. The sheer volume of products with crazy names can be overwhelming, but it helps me sleep at night knowing that most interoperate with each other.

We need each other

Above all else, it was clear that everything about the data science and engineering space is still maturing, from our organizational structures to our tools and techniques. That’s why the best moments of the two days were the opportunities to talk shop — and occasionally commiserate — with others facing the same problems.

Although anyone looking for silver bullets was probably out of luck, it was comforting to hear that even larger companies — Netflix, Slack, Uber — are still figuring out how to deal with some fundamental problems around the interface of data science and engineering.

Kudos to Pete Soderling and Hakka Labs for creating an excellent environment for the community to come together. I’m looking forward to the next one.

--

--