Subsurface Day 1 Takeaways

Matt Weingarten
3 min readMar 4, 2023
A narwhal as a mascot? Now you’ve got my attention

Introduction

Dremio hosted their wonderful Subsurface Conference earlier this week, so I’m just now making my way through all the on-demand sessions (shoutout to all the conferences who do this). It’ll certainly be difficult to summarize considering how many informative sessions there were (especially since we’re right now doing a lot of research on Iceberg), but I’ll try my best.

Migrating To Iceberg

Check this talk for a good overview on Iceberg migration.

I assume we’re all familiar with the classical Hive table format that you see in many big data applications these days. As useful as these tables can be sometimes, they still have their cons, such as inefficiencies with smaller updates, long wait times for directory listings, and a lack of schema evolution. By migrating to a newer format like Iceberg, you can take advantage of partition evolution (no need to rewrite the table) and hidden partitioning (no need to know the ins and outs of a table for performant queries).

The actual migration itself isn’t even that difficult. An in-place migration uses the Parquet files of a table in a different format to create a new Iceberg table or update an existing one. Shadow migrations rewrite all the data into a new Iceberg table using a CTAS statement, which allows optimizations to be applied during the migration effort.

Our processing, which deals with the scale of around a billion records per day, is definitely struggling with some of the Hive drawbacks, and that’s only going to get worse with more data. Migration will be necessary at some point in the future (trust me, we have it on the roadmap), so to be able to see that’s it can easily be done with Iceberg is an encouraging sign.

Product Analytics

Check this talk for a good overview on product analytics.

Product analytics can be defined as the process of gathering and transforming user-level data into insights that reveal how customers/consumers interact with products and services. Wayfair has done a really nice job of putting together guiding principles, such as building analytic maturity in product teams and owning metrics through self-service dashboarding and analysis tools, into a standardized product analytics pipeline workflow. This pipeline handles the ingestion, collation, curation, presentation, and visualization of data from end-to-end.

The push for self-service BI tooling is the new focus of data-driven organizations, and I’m all here for it. This will only in turn lead to an increased understanding of how the data works throughout the entire team, which is certainly a good thing. Processes like what Wayfair are putting in place are a great way to automate that work and make it more streamlined.

Data As Code

Check this talk for a good overview on DaC.

I’m a very strong advocate of infrastructure as code (IaC), so could DaC or data as code be the next big thing? Having this in place ensures isolation (experimenting with data without impacting other users), version control (recovering from any point in time), and governance, which is key. Tools like lakeFS are geared towards making this more of a reality.

Data is only continuing to grow, and having proper controls in place is going to be the only way to ensure that data is handling properly. Hopefully, DaC can experience growth in the future like IaC has in the last decade. For those who are still skeptical, I recommend checking out this blog for more on the importance of DaC.

Conclusion

So far, I have really enjoyed the sessions that Dremio has put together for Subsurface. I’ll be posting my day 2 takeaways soon, so stay tuned.

--

--

Matt Weingarten

Currently a Data Engineer at Disney Streaming Services. Previously at Meta and Nielsen. Bridge player and sports fan. Thoughts are my own.