Managing stale data in the cloud
The Big Data Platform team at Hotels.com builds products to support the needs of analysts, data engineers and data scientists across the business. We know that maintaining healthy datasets is a fundamental aspect of any data lake, and an important part of this is managing data that is no longer used.
Many of our data teams utilise Apache Hive for accessing data. Hive is a tool that allows analysts to query data using SQL, and also abstracts users away from where the actual data is stored (e.g. S3, HDFS etc.). Because the layout of the data can be managed separately, this can lead to situations where data exists in a file system but is inaccessible via a Hive query, for example if one drops a partition on an external Hive table.
These datasets unreferenced by Hive become outdated and unused. We’ve found that orphaned datasets can not only add uncertainty for users, but they incur additional needless cost.
Keeping data fresh
To address the issue of stale data, we developed a service called Beekeeper.
The original inspiration for a data deletion tool came from another of our open source projects called Circus Train. At a high level, Circus Train replicates Hive datasets. The datasets are copied as immutable snapshots to ensure strong consistency and snapshot isolation, only pointing the replicated Hive Metastore to the new snapshot on successful completion. This process leaves behind snapshots of data which are now unreferenced by the Hive Metastore, so Circus Train includes a Housekeeping module to delete these files later.
Circus Train’s Housekeeping module has been re-written and updated to create a new project called Beekeeper, though the concept remains the same; Beekeeper is a standalone service for deleting stale data no longer referenced by the Hive Metastore.
How does it work?
Beekeeper comprises two separate Spring-based Java applications. One application schedules paths for deletion in a shared database, and the other performs deletions — simple really!
A deeper dive
Beekeeper makes use of Apiary — an open source federated cloud data lake, built by a collaboration of different teams at Expedia Group — to detect changes in the Hive Metastore. One of Apiary’s components, the Apiary Metastore Listener, captures Hive events and publishes these as messages to an SNS topic.
The first Beekeeper application, Path Scheduler, consumes messages emitted from this topic via an SQS queue. Path Scheduler parses the messages’ Hive events and adds the data locations — the paths — to a database. It does so on drop table/partition events and also alter table/partition events which are not just a metadata update i.e. the table/partition location has changed.
The second Beekeeper application, Cleanup, is a job which runs periodically. It queries the database for paths which are ready for deletion, and subsequently performs the delete.
To switch Beekeeper on, all a Hive table needs is a table parameter:
If this parameter is not present, events will be ignored by Path Scheduler and no data will be scheduled for deletion.
By default, Beekeeper will wait for three days before deleting the data at a path. This period can be changed by setting another parameter on a Hive table. For example, to configure Beekeeper to wait 7 days before performing a delete:
Deploying Beekeeper is simple. The applications run in their own Docker containers, and so can be deployed anywhere Docker is running — we are currently using ECS Fargate. You will also require the Apiary Metastore Listener to be installed in your data lake so that Path Scheduler can consume events that are emitted by your Hive Metastore.
We have a separate git repository with Beekeeper’s terraform module, if you would like to take a closer look. This repository contains all of the terraform needed to spin up Beekeeper on ECS.
Why is Beekeeper important for your data lake?
For Hive users, Beekeeper takes care of orphaned data deletion where you either previously may not have, or have had to create and maintain custom solutions to do this. For example, in the past we’ve seen that teams using snapshot isolation will store orphaned locations in a text file and use a cron job to delete them on a predefined schedule.
For Circus Train users, Beekeeper removes all the fuss of managing the Housekeeping database to keep track of your stale snapshots.
We think that Beekeeper is a more scalable solution. Beekeeper allows teams to focus on their data processing logic, simply add a parameter to a Hive table and let Beekeeper handle the rest. This enables teams to free up more time and focus on features. Beekeeper is a data deletion tool for cross-team usage and by centralising this function we hope to reduce costs with improved efficiency.
In future iterations of Beekeeper, we’re hoping to introduce new features which are summarised here:
REST API We want to allow users to interact more easily with the service; to schedule paths for deletion, to remove a path for deletion, see what deletions are coming up, see the status of a deletion, update paths and more. We hope that this will open up the usage of the service beyond that of Hive users.
HDFS Support Beekeeper currently only supports S3 however we have architected Beekeeper to allow us to add support for more filesystems. In the near future we hope to add support for HDFS.
S3 Optimisations We are looking at ways in which we can provide a deeper integration with S3 features. For example, instead of deleting data, Beekeeper could move data into a cheaper storage tier. This could be useful for stale data which shouldn’t be deleted, for example if it is legally required to be kept.
Kubernetes Finally, we would like Beekeeper to be cloud agnostic and so are looking at building support for Kubernetes.
Interested in using Beekeeper?
Once Apiary and Beekeeper are installed in your data lake, managing orphaned data is achieved by simply adding a parameter to the Hive tables that you would like Beekeeper to manage — it’s that simple. Please take a look at the Beekeeper repo for more info on getting started.
Thanks for reading!