Introducing Beekeeper

Managing stale data in the cloud

Max Jacobs
Oct 22, 2019 · 5 min read
Photo by Annie Spratt on Unsplash

The Big Data Platform team at Hotels.com builds products to support the needs of analysts, data engineers and data scientists across the business. We know that maintaining healthy datasets is a fundamental aspect of any data lake, and an important part of this is managing data that is no longer used.

Many of our data teams utilise Apache Hive for accessing data. Hive is a tool that allows analysts to query data using SQL, and also abstracts users away from where the actual data is stored (e.g. S3, HDFS etc.). Because the layout of the data can be managed separately, this can lead to situations where data exists in a file system but is inaccessible via a Hive query, for example if one drops a partition on an external Hive table.

These datasets unreferenced by Hive become outdated and unused. We’ve found that orphaned datasets can not only add uncertainty for users, but they incur additional needless cost.

Keeping data fresh

The original inspiration for a data deletion tool came from another of our open source projects called Circus Train. At a high level, Circus Train replicates Hive datasets. The datasets are copied as immutable snapshots to ensure strong consistency and snapshot isolation, only pointing the replicated Hive Metastore to the new snapshot on successful completion. This process leaves behind snapshots of data which are now unreferenced by the Hive Metastore, so Circus Train includes a Housekeeping module to delete these files later.

Circus Train’s Housekeeping module has been re-written and updated to create a new project called Beekeeper, though the concept remains the same; Beekeeper is a standalone service for deleting stale data no longer referenced by the Hive Metastore.

How does it work?

A deeper dive

Beekeeper architecture

The first Beekeeper application, Path Scheduler, consumes messages emitted from this topic via an SQS queue. Path Scheduler parses the messages’ Hive events and adds the data locations — the paths — to a database. It does so on drop table/partition events and also alter table/partition events which are not just a metadata update i.e. the table/partition location has changed.

The second Beekeeper application, Cleanup, is a job which runs periodically. It queries the database for paths which are ready for deletion, and subsequently performs the delete.

To switch Beekeeper on, all a Hive table needs is a table parameter:

beekeeper.remove.unreferenced.data=true

If this parameter is not present, events will be ignored by Path Scheduler and no data will be scheduled for deletion.

By default, Beekeeper will wait for three days before deleting the data at a path. This period can be changed by setting another parameter on a Hive table. For example, to configure Beekeeper to wait 7 days before performing a delete:

beekeeper.unreferenced.data.retention.period=P7D

Deployment

We have a separate git repository with Beekeeper’s terraform module, if you would like to take a closer look. This repository contains all of the terraform needed to spin up Beekeeper on ECS.

Why is Beekeeper important for your data lake?

For Circus Train users, Beekeeper removes all the fuss of managing the Housekeeping database to keep track of your stale snapshots.

We think that Beekeeper is a more scalable solution. Beekeeper allows teams to focus on their data processing logic, simply add a parameter to a Hive table and let Beekeeper handle the rest. This enables teams to free up more time and focus on features. Beekeeper is a data deletion tool for cross-team usage and by centralising this function we hope to reduce costs with improved efficiency.

Roadmap

REST API We want to allow users to interact more easily with the service; to schedule paths for deletion, to remove a path for deletion, see what deletions are coming up, see the status of a deletion, update paths and more. We hope that this will open up the usage of the service beyond that of Hive users.

HDFS Support Beekeeper currently only supports S3 however we have architected Beekeeper to allow us to add support for more filesystems. In the near future we hope to add support for HDFS.

S3 Optimisations We are looking at ways in which we can provide a deeper integration with S3 features. For example, instead of deleting data, Beekeeper could move data into a cheaper storage tier. This could be useful for stale data which shouldn’t be deleted, for example if it is legally required to be kept.

Kubernetes Finally, we would like Beekeeper to be cloud agnostic and so are looking at building support for Kubernetes.

Interested in using Beekeeper?

Thanks for reading!

Expedia Group Technology

Stories from the Expedia Group Technology teams

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store