Federating Hive with Waggle Dance

Hotels.com Engineering
The Hotels.com Technology Blog
3 min readNov 30, 2017
Figure-Eight-Shaped Waggle Dance of the Honeybee (Apis mellifera) by Chittka L is licensed under CC BY 2.5

Hotels.com has recently contributed the Waggle Dance project to the open-source community. The unusual name is taken from a dance that bees perform for the rest of their colony, helpfully directing them to sources of food. This is pertinent because the purpose of our project is to direct Apache Hive based data consumers to sources of data.

At its core Waggle Dance is a request routing proxy that allows datasets to be concurrently accessed across multiple Hive deployments. It was created to tackle the appearance of dataset silos that arose as our large organization gradually migrated from monolithic on-premises big data clusters, to cloud based platforms. As our brands and teams raced to the cloud, they often setup their own Hive infrastructure to retain their agility and maintain their pace. However the data that these groups then produced could not easily be discovered or accessed by others, potentially limiting its value.

A fundamental component of the Hive data warehousing system is the metastore; a service that is responsible for maintaining metadata for all the datasets in the warehouse. This ‘data-about-data’ typically describes dataset schemas, encodings, the file store locations of the raw data, and also optimization hints such as table and column statistics. Data consumers rely on this metadata to find, understand, and efficiently process their target dataset. The metastore has proved to be so convenient that its use is not limited to Hive and there are mature integrations with many other popular data processing platforms (Spark, Flink, Cascading). In cloud environments the metastore provides another benefit; delivering consistent data access patterns over eventually consistent file stores. However, in our cloud of multiple Hive deployments, the monolithic and isolated nature of the metastore is extremely limiting. Our consumers need to access and analyse data maintained in multiple metastores from a single Hive instance and Hive’s architecture does not support this.

Waggle Dance solves this by providing a unified end point with which you can describe, query, and join tables that may exist in multiple distinct Hive deployments. Such deployments may exist in disparate regions, accounts, or even clouds (security, network, and the laws of physics permitting). Dataset access is not limited to the Hive query engine, and should work with any Hive metastore enabled platform. We’ve been successfully using it with Apache Spark for example.

We also use Waggle Dance to apply a simple security layer to cloud based platforms such as Qubole, DataBricks, and EMR. These currently provide no means to construct cross platform authentication and authorization strategies. Therefore we use a combination of Waggle Dance and network configuration to restrict writes and destructive Hive operations to specific user groups and applications.

Please check out our project on the Hotels.com GitHub account: https://github.com/HotelsDotCom/waggle-dance.

If you find the project useful, be sure to join the Waggle Dance user group: https://groups.google.com/forum/#!forum/waggle-dance-user

--

--