EXPEDIA GROUP TECHNOLOGY — DATA

Introducing Beekeeper Time-To-Live (TTL)

Automate clean up of temporary Hive tables in the data lake

Vedant Chokshi
Expedia Group Technology

--

A beekeeper tending to a beehive box
Photo by Annie Spratt on Unsplash

In 2019, we announced our open-source automated data clean up service Beekeeper. At Expedia Group™, we use Beekeeper to delete unreferenced data snapshots left behind by various data processing tools that follow the snapshot isolation pattern.

Temporary Data in a Large Data Lake

Beekeeper is great at handling cases where data is often restated but what happens when there is temporary data? In a large data lake, we often have times where many tables are created for temporary analytical use cases that are only required for a short period.

Since Beekeeper only cleans up orphaned paths, users need to remember to manually drop tables after use so the paths become unreferenced and can therefore be picked up by Beekeeper. The process is tedious, time-ineffective and can often easily be missed. Furthermore, this results in a lot of junk data in the data lake which is also cost-ineffective.

Beekeeper Time-to-Live (TTL)

To tackle the problem above, we have now released Beekeeper 3.1.0 with support for cleaning up temporary tables. A time-to-live (TTL) can be applied at the table-level to create a “temporary” table in the data lake. Beekeeper will then monitor this table and ensure that both the table metadata and underlying data are removed once the TTL is met.

A flow diagram showing how a Hive metastore event flows through beekeeper with Beekeeper metadata cleanup and Beekeeper path cleanup modules.
Beekeeper Architecture Overview

The Beekeeper scheduler application has been updated to now also schedule TTL events. These scheduled TTL events are cleaned up by a new metadata clean up application which is responsible for cleaning the metadata as well as the underlying data.

For partitioned tables, the TTL will be applied for each partition at the time of creation. A particular partition will be dropped once its TTL is met and the corresponding table will only be dropped when there are no remaining partitions. This is useful when users want to expire data in a rolling window manner e.g. keeping data for the last 30 days, but delete everything before that.

For unpartitioned tables, the TTL will be applied to the table and the entire table is dropped when the TTL is met. This is useful when users want to create temporary tables e.g. tables for testing purposes.

How To Use It?

A user can simply create a table in a data lake and configure Beekeeper to monitor it by setting the following Hive table parameter:

beekeeper.remove.expired.data=true

By default, Beekeeper will set the TTL to 30 days. However, this period can easily be configured by setting another table parameter:

beekeeper.expired.data.retention.period=X

where X is a duration of the ISO_8601 format e.g. P7D or PT3H .

Interested in Using or Knowing More About Beekeeper?

To know more about getting started with Beekeeper, have a read of our original announcement which takes a deep dive into the project or read the documentation on our GitHub.

Feedback

Already using Beekeeper? We would love to hear your feedback (if it's good 😄) on this new feature or know about any other new feature requests.

Thank you for reading!

--

--