Themes from the Subsurface Data Lake Conference

And coming to terms with a never-ending stream of cheesy aquatic metaphors

Paul Singman
Whispering Data
4 min readSep 15, 2020

--

Losing sight of the larger trends shaping the data ecosystem is easy when down in the trenches of the daily work grind. Because of this I love attending conferences that serve as an inspirational footstool from which one can easily glimpse the full landscape picture.

The first ever Subsurface Cloud Data Lake Conference held virtually in July was a great opportunity for this as it featured an impressive lineup of speakers that thoughtfully contextualized where we are with data in 2020.

Talk #1: The Future is Open — The Rise of the Cloud Data lake

Tomer Shiran, Co-founder & CPO, Dremio

The whole event was put together by Dremio and began with an awesome opening keynote by one of its founders, who gave an overview of how data architectures have evolved over the last 10 years.

The main idea is this: we’ve gone from proprietary, monolithic analytic architectures anchored by an expensive Oracle-licensed database or Hadoop cluster… to architectures defined by flexible, increasingly open source technologies across the four main layers of a data stack: the storage layer, data layer, compute layer, and client layer.

The base of the stack — the storage layer — is supported by the crucial development of cloud technologies like S3 and ADLS that offer infinitely scalable, highly-available, globally distributed, easily-connected-to, and outrageously cheap cloud storage.

The ability to agnostically use these storage blobs to separate data storage from specialized compute engines (like Spark, Snowflake, and Athena) is the dominant architectural trend that nearly every talk mentioned.

So if nothing else, leave this article with that concept clear in your mind.

Talk #2: Apache Arrow: A New Gold Standard for Dataset Transport

Wes Mckinney, Director, Ursa Labs

Wes began by explaining that there’s a problem when you have a bunch of different systems handling data each with potentially its own storage and transport protocols. The problem is a “combinatorial explosion of pairwise data connectors” as he calls it, that manifests as costly-to-implement custom data connectors developers must create if they want to transport data efficiently in their data pipelines or applications.

This is one of a few issues highlighted as the inspiration for the Apache Arrow project. The others being:

  1. Unnecessary CPU time spent serializing & de-serializing data
  2. Expensive writes to disk/blob storage as an intermediary
  3. Decreased performance due to executor node bottlenecks in distributed systems

And so Wes continued by explaining some of the technical concepts behind Arrow’s solution to these problems.

The end result is a mostly behind-the-scenes library that makes — for example, converting between Spark and Parquet more efficient — and all of us more productive in the long run.

The image that became clear in my mind is how Arrow aims to be an in-memory intermediary between systems, the same way a lot of folk use S3 (for lack of a better option) for that purpose.

Talk #3: Functional Data Engineering: A Set of Best Practices

Maxime Beachemin, CEO and Founder, Preset

Lastly, Maxime gave an interesting talk on how functional programming principles can be applied to the data engineering discipline to create reliable data pipelines.

The three principles are:
1. Pure Functions — Same input = Same Output
2. Immutability — Never changing the value of variables once assigned
3. Idempotency — The ability to repeat an operation without changing the result

Taken in a data engineering context, Beauchemin recommends writing ETL tasks that are “pure”. This means given the same input data, they will output the same data partition which you can INSERT OVERWRITE into your data lake (an idempotent operation compared to a mutable UPSERT).

Without going into too much detail, I’m inspired to leverage these concepts to add structure to the way I think of my ETL tasks, instead of a tangled mess of logic whose output I have little understanding of.

Thank you for reading and look forward to a recap of the equally interesting Future Data Conference next week!

--

--

Paul Singman
Whispering Data

Data @ Meta. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.