Welcome back to second part of the series about Data versioning in Delta Lake.
In the first part of series, We learned how to create, read, update and modify schema of the table using Delta Lake.
Time Travel in Data Lake?
Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Delta automatically versions the big data that you store in your data lake, and you can access any historical version of that data. This temporal data management simplifies your data pipeline by making it easy to audit, roll back data in case of accidental bad writes or deletes, and reproduce experiments and reports. Your organization can finally standardize on a clean, centralized, versioned big data repository in your own cloud storage for your analytics.
Getting started with Time Travel in Data Lake:
I am loading csv file with three records into table.
Appending few more records to table using update method.
As you write into a Delta table or directory, every operation is automatically versioned. You can access the different versions of the data two different ways:
- Using a version number
In Delta, every write has a version number, and you can use the version number to travel back in time as well.
2. Using a timestamp
You can provide the timestamp or date string as an option to get that version of table:
Audit data changes:
Auditing data changes is critical from both in terms of data compliance as well as simple debugging to understand how data has changed over time. Organizations moving from traditional data systems to big data technologies and the cloud struggle in such scenarios.
Using Delta we can see audit logs of each table:
Time travel in Delta improves developer productivity . It helps:
- Data engineers simplify their pipelines and roll back bad writes
- Data analysts do easy reporting
In next series of Delta Lake, I will explain about Partitioned Data lake with examples.
Thanks for reading!!!!
See you soon :)