Old but gold: implementing a Hive Metastore Infrastructure

Rafael Ribaldo
Blog Técnico QuintoAndar
5 min readApr 9, 2021
Image by Datadog

Background

In data platforms of the first generation (i.e., Data Warehouses) the metadata (i.e., schemas, tables, privileges, and so on) was managed by classical RDBMS. At the beginning of the second generation data platforms (i.e., data lakes), we did not have a way to manage metadata of the data stored in distributed filesystems (e.g., HDFS). Moreover, such data was written to and retrieved from the distributed filesystem using Java (see details in Hadoop: the definitive guide), which is not a very popular way considering the broad variety of users who need to get value from data (e.g., data analysts, managers, etc.).

Given the need for analyzing a huge amount of data stored in HDFS in an accessible way (i.e., using a SQL-like language), Facebook developers created the well-known Hive project, which was first released in 2010.

Old but gold

Although Hive could be considered an old-fashioned tool in our ever-changing tech scenario, it's clear that it is a successful project, being used in several modern data tools (e.g., AWS Glue, Trino, Databricks, etc). In addition, virtually all analytics tools have built-in connectors for Hive.

What is Apache Hive?

In short, Hive is the data warehouse software for the big data ecosystem built on top of Hadoop. Its success largely is owed to the simplicity to manage and process large amounts of data stored in distributed filesystems using a SQL-like language (a.k.a., HiveQL, or HQL). Basically, it abstracts all the complexity to read/write data from/to these distributed filesystems using HQL.

Hive architecture is composed of two components: HiveServer and Hive Metastore.

Hive Architecture Overview

HiveServer is responsible for processing data using distinct engines under the covers, such as MapReduce, Spark, Tez, and more recently MR3. HiveServer also provides a client for interactive queries.

Hive Metastore (a.k.a. HMS) is responsible to manage and persist metadata in a relational database (indeed, it uses the DataNucleus ORM). HMS also provides a Thrift server for client connections, as the Python client we recently published at QuintoAndar repository.

As we decided to implement and manage our own query engine layer using Trino, we’ve chosen Hive Metastore as our data catalog. Moreover, “beginning in Hive 3.0, the Metastore can be run without the rest of Hive being installed. It is provided as a separate release to allow non-Hive systems to easily integrate with it” (see here). This decoupling also provides us a way to implement Hive Metastore as a stateless microservice in our Kubernetes infrastructure.

Below we described how Hive Metastore can be implemented as a scalable and secure Kubernetes (a.k.a., K8s) application.

Implementation

Hive Helm Chart

Our applications are created in K8s using Helm Charts, which are “a collection of files that describe a related set of Kubernetes resources”. As HMS had some peculiarities compared with other applications here, we needed to create a specific Hive chart. Below we’ll describe the resources and configurations of this chart.

The chart uses a customized docker image and the following Kubernetes resources (the complete resources configuration is available in this repository):

  • Deployment
  • Service
  • Configmap
  • Horizontal Pod Autoscaler

Docker

There are several Hive docker images available, however, we created a new one because we are using the Hive Metastore standalone. We also created a customized entrypoint.sh that parses the Hive config file template (see Configmap below) to set some specific parameters, such as database credentials, metastore URI, and other desirable parameters (e.g., plugins, hooks, listeners).

Note: Specifically in our case, database credentials are retrieved from Vault when the container is created and the entrypoint script parses the credentials as the container’s environment variables using gomplate.

Deployment

Deployment is the main resource because it integrates the Service, Pod, ReplicaSet, and other resources in a manageable release of the application. Below we can see the Deployment configuration (available here):

The main configurations in the above example are (the remaining configs are relatively common to K8s apps):

  • image : name of the docker image that must be built locally using this Dockerfile
  • volumes: maps the path to the Hive XML config file
  • ports : Thrift server port to be exposed

Service

This resource exposes the application to the external world through the Thrift port 9083, as the following example:

There are different service types depending on the way your application will be exposed. Here we’re using ClusterIP, but in a production environment, you could use LoadBalancer type behind a DNS service provided by some cloud vendor. This load balancer will distribute the workload between pods, in the case of scaling up the application (see HPA below).

Configmap

This is a critical resource of this chart because it maps our Hive XML template file which contains Hive database credentials and other parameters that are customizable (see the entrypoint.sh of the docker image):

Note that if the value env.HIVE_DB_EXTERNAL is false, K8s will not create a configmap and Hive will use its default embedded Apache Derby database instead of an external database.

Horizontal Pod Autoscaler (HPA)

This resource is responsible for the application (auto)scaling (i.e., increase pods) based on some resources threshold (i.e., memory and CPU):

Here, the thresholds are defined in the values file (values.yaml) with the variables of the Hive instance.

Applying the chart in K8s

You can install the chart in your Helm or simply template it and apply it in your K8s cluster:

Monitoring

One great benefit of deploying Hive in our K8s infrastructure is monitoring. Our monitoring stack is composed essentially of Prometheus and Grafana. Through Prometheus Alert Manager we can define rules to alert our team about Hive pods resource usage, scaling up events, pod restart reasons and etc. These alerts can be sent to Slack or to our On-call/Incident management tool.

Additionally, using Grafana dashboards we can understand the pod’s behavior through time, for example, we can understand what are the periods we have peaks of resource (i.e., memory and cpu) usage.

Conclusion

Hive is a mature project and is the natural choice for using Trino (old PrestoSQL). Although it is not obviously mandatory to deploy HMS standalone in K8s, using a Helm chart to configure/customize our deployment makes it easy to reuse the infrastructure and architecture in different business domains, which in turn also improves the maintainability of our infra-as-code strategy.

Thanks Kenji and Amom Mendes for all the discussions and hard work!

--

--