Introducing Delta Lake — On Data Lake | by Leena Bejoy | Brillio Data Science

Published in

Brillio Data Science

2 min readJun 27, 2019

Introducing Delta Lake — On Data Lake

Databricks Delta Lake runs on top of your existing Data Lake and provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing, schema enforcement and Time Travel (Data versioning).

To create a Delta Lake table, you can use existing Spark SQL code and simply say Delta…

dataframe
   .write
   .format("delta")
   .save("/data")

OR

CREATE TABLE events
USING delta
AS SELECT *
FROM json.`/data/events/`

One of the most unique characteristics of Delta is its “Time travel capabilities” , built on top of Apache spark automatically versions the big data you store in your Data Lake, and you can access any historical version of the data. This temporal data management simplifies your data pipeline by making it easy to audit, roll back data in case of accidental bad writes or deletes, and reproduce experiments and reports.

As you write into a Delta table or directory, every operation is automatically versioned. You can access the different versions of the data two different ways:

Using timestamp —

Scala syntax:

You can provide the timestamp or date string as an option to DataFrame reader:

val df = spark.read
  .format(“delta”)
  .option(“timestampAsOf”, “2019-01-01”)
  .load(“/path/to/my/table”)

SQL syntax:

SELECT count(*) FROM my_table TIMESTAMP AS OF “2019-01-01”
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF “2019-01-01 01:30:00.000”

2. Using a version number
In Delta, every write has a version number, and you can use the version number to travel back in time as well.

Scala syntax:

val df = spark.read
  .format(“delta”)
  .option(“versionAsOf”, “5238”)
  .load(“/path/to/my/table”)val df = spark.read
  .format(“delta”)
  .load(“/path/to/my/table@v5238”)

SQL syntax:

SELECT count(*) FROM my_table VERSION AS OF 5238
SELECT count(*) FROM my_table@v5238
SELECT count(*) FROM delta.`/path/to/my/table@v5238`

Delta Lake time travel allows you to query an older snapshot of a Delta Lake table. Time travel has many use cases, including:

Re-creating analyses, reports, or outputs (for example, the output of a machine learning model). This could be useful for debugging or auditing, especially in regulated industries.
Writing complex temporal queries.
Fixing mistakes in your data.
Providing snapshot isolation for a set of queries for fast changing tables.

Well its time to go for Delta on Data Lake to make your analytics …ay simpler

Written by Leena Bejoy