Whispering Data
Published in

Whispering Data

Thoughts on the Future of the Databricks Ecosystem

Databricks has come a long way since growing out of a Berkeley Lab in 2013 with an open-source distributed computing framework called Spark.

Fast forward eight years and in addition to the core Spark product, there are a dizzying number of new features in various stages of public preview within the Databricks platform. In case you haven’t been keeping up, these include:

Taken together, it’s interesting to note that these features provide comparable functionality to the set of tools commonly referred to as the “Modern Data Stack”.

The result is a noticeably consolidated data stack, almost entirely contained within the Databricks ecosystem.

Some people cheer for this type of consolidation, tired of spending time fitting together pieces of an analytics puzzle that don’t necessarily want to get along. Others believe an unbundled architecture is preferable, allowing users can mix-and-match tools specialized for a specific purpose.

In truth, there’s no clear answer of who is right. It depends largely on the execution of the different companies competing in the space. For its part, lakeFS is largely agnostic in this battle, as it fits at a foundational level with nearly any stack.

Given their positioning, Databricks sees value in growing the data lake ecosystem, which includes lakeFS. Consequently, we’ve started to collaborate more closely with members of the Databricks team, in both content and product.

Data + AI Online Meetup Recap

One of the first outcomes of this collaboration is a joint meetup presentation with myself and Denny Lee.

The Topic: Multi-Transactional Guarantees with Delta Lake and lakeFS.

The Key Takeaway: The version-controlled workflows enabled by lakeFS allows you to expose new data from multiple datasets in one atomic merge operation. This prevents the possibility of a consumer of the data seeing an inconsistent view, which can lead to incorrect metrics.

After showing how to configure a Spark cluster to read/write from a lakeFS repo, I hopped into a demo of running a data validation check with a Databricks Job and lakeFS pre-merge hook.

Check out the full talk below!

Interested in learning more?

Originally published on the lakeFS blog.




Whispering Data is a Medium publication for all the data & productivity secrets you wish you knew years ago!

Recommended from Medium

Testing Schema Registry with Spring Boot and Spring Kafka using MockSchemaRegistryClient

{UPDATE} Ataque de Lobos 3D Mejores Juegos de Carreras de Animales para Niños Gratis Hack Free…

Async Stream-Based FileServer with Akka Http and Akka Streams

Why I Love Serverless Cloud

Efficient Data Writing in TDengine

How Object Behave

The Recipe for the Perfect Agile Team

WebGL Overview

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Paul Singman

Paul Singman

DevRel @lakeFS. Ex-ML Engineering Lead @Equinox. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.

More from Medium

6 inconvenient truths about Apache Airflow (and what to do about them)

The holy grail of data platforms and why I rejoined Databricks

Data Mesh Explained

Great (data) expectations — automatic data quality validation