The 3 Most Interesting Ideas From the Future Data Conference
We’re talking Snowflake vs AWS, Data Lakehouses, and Data’s Disempowerment of Decision Makers.
Idea #1: Snowflake and Databricks’ Number 1 Competitor is… AWS
The dynamic between AWS and companies like Databricks and Snowflake is one I’ve wondered about before. It is hard not to since AWS has its own products that are direct competitors to both Databricks (EMR) and Snowflake (Redshift). While at the same time both Databricks and Snowflake are both built largely atop the AWS cloud platform. Surely there must be tension there, made even more palpable by the recent stunning Snowflake IPO.
It was nice to hear Ben Horowitz, one of the most successful VC investors, give his thoughts here. Essentially, data warehouses and distributed compute environments are extremely difficult products to build, and it would be naive to assume that AWS’s expertise in building scalable and reliable cloud services automatically transfers to building more feature-rich, user-friendly SaaS products (even ones built using the AWS cloud).
Historically there has always been room above a platform (think Oracle servers atop the Microsoft Windows operating system) for independent products to exist, and the cloud platform is no exception.
Ben clarifies he looks for three specific criteria when evaluating products that may compete with AWS. He asks is there:
1. A big enough category for the product, and enough depth to it.
2. A strategic reason for a user to want the product to be independent of AWS.
3. A company building it with great leadership, engineering teams, and the ability to iterate quickly.
When all three are present, the result is the biggest software IPO of all time.
Idea #2: Data Warehouse → Data Lake → Data… Lakehouse?
Speaking of Databricks, they are at the forefront of a movement to transform the modern data stack from a data warehouse-centric model to data lake-centric one. Mahei Zetaria, co-founder & CTO of Databricks, gave a compelling explanation of this vision and the problems it aims to solve.
The current analyst-empowering model of dbt-orchestrated SQL transformations inside a warehouse has been wildly successful in allowing data teams to make use of their data to better understand their business.
Where this solution falls short is in supporting consumers of data with greater performance or latency concerns than a BI report or daily dashboard. A data science model that predicts off of user data, for example, would need to either duplicate ETL logic performed inside the data warehouse or inefficiently source from periodic exports of warehouse tables.
To get around this problem, Databricks is building features to enrich lake-centric solutions. The promise is to enhance the usability of data lakes to match the upside of warehouses without the cost and proprietary downsides. In this new model, the warehouse moves from the star of the analytics-show, to simply another consumer of the data lake.
Are the benefits of this architecture worth the cost in adopting it? I think the usability and performance of some of the products Databricks is building will be the ultimate determinant of that.
Idea #3: The Modern Data Stack Disempowers Decision Makers
When you step back and think about it, the purpose of what we do is to help decision makers make better decisions with data. Of course, when there’s a large volume of it and latency requirements, this is not so simple, and is why data specialists using specialized tools to analyze data has become “a thing”.
For all the effort invested by companies into analytic capabilities, it’s not clear that it has impacted a majority of decision makers positively, who now have to navigate a data team or a Looker dashboard — instead of receiving weekly aggregated dump of data in Excel to toy around with — which is fragile and inefficient, but often effective enough.
This is one of the thoughts Tristan Handy offers in his forward-thinking talk on The Modern Data Stack: Past, Present, and Future. He argues many people are cut off from data they once had access to, and new tools should be developed to democratize data better within companies (he prognosticates the return of the spreadsheet).
More generally, I think it is important for people in data roles (especially at larger companies) to not get too lost in the day-to-day grind of making sure ETL jobs are running and numbers in reports are accurate — and maintain a constant awareness of what decision making is being improved by the data you are providing.
If you have this sort of visibility in your role, appreciate and learn from it. If you do not, continue to stay genuinely curious and ask those type of holistic questions until you do.
To view all the talks at this year’s Future Data Conference, click here. Thank you for reading! Have a nice rest of your day.