Hive-Metastore-Client: the beekeeper

Communicating with the standalone Hive Metastore service

Felipe Miquelim
Blog Técnico QuintoAndar
8 min readMay 17, 2021

--

Photo by Deans Court on Unplash

At QuintoAndar, we recently implemented the standalone version of Hive Metastore to be our unified metadata repository. We explained in detail how the service is a perfect fit for our scenario and how Thrift assists developers to implement it as a standalone solution.

One of the challenges of exploring this implementation is the direct communication with the Hive Metastore. The interaction can be customized via Thrift framework however, it is not trivial.

Happy Life GIF by POKOPANG on Giphy

Working with a Hive Metastore can be as tricky as working with a beehive… nothing better than a beekeeper to assist us. And thus was born QuintoAndar’s Hive-Metastore-Client, our open-source library to handle metadata in a standalone Hive Metastore service.

🍯 Standalone Hive Metastore

We explained in detail that the Hive Metastore service can be detached from the original Hive stack (since Hive 3.0). It is an alternative approach to utilize the world-class Hive feature of metastoring without having to absorb its coupled query engine. That widens the range of possibilities to use the query engine that better fits your scenario and stack, in our case we’ve chosen Trino to pair up with the Hive Metastore.

📚 Hive Metastore Libraries

When working with a metadata-architecture based on the standalone Hive Metastore, it’s required an alternative way to interact with it, since there is no native query engine (HiveServer2). One could use Trino or another query engine to manage metadata directly, however at QuintoAndar we focus on decoupled services with well-defined scopes, then we do not set the query engine as the responsible for managing metadata. The responsible to manage metadata should be another entity that interacts with the Hive Metastore via an API, client, or another communication interface.

Sadly, Hive Metastore standalone is not widely used by the community yet. Still, the most used architecture is based on the default implementation, with both HiveServer2 and Metastore services. And as a consequence, there is not an official Python library available nor a renowned third-party one.

Thrift Based Library

On the other hand, the Hive project provides, in its repository, the Thrift mapping for Hive Metastore methods and respective attributes, which allows developers to easily generate the classes as a Python project (such as our case).

By externalizing those classes to a Python project it arises the possibility to overwrite or extends the default methods. This opens a whole new range of possibilities for custom implementations!

Inspiration

There were already some libraries in the community that followed that road and offered Python projects with the auto-generated files from Thrift for the Hive Metastore, yet they were not that much modified from the base files, thus not fitting completely in our scenario.

A suitable library for us would have methods that abstract server calls; handle errors; apply default parameters for custom calls; etc; and although those libraries did not offer some more advanced features as we needed, we based ourselves on them to implement our own tailor-made one!

🐝 Hive-Metastore-Client

We created our own Python library to communicate directly with a Hive Metastore and execute abstracted DDL operations based on Thrift: the hive-metastore-client.

Benefits

Having our own library empowers us to overwrite pre-existent features and extends more features as we see fit. It also gives the means to update our implementation according to upcoming new versions of Hive and Hive Metastore.

Besides, by creating a dedicated project we can focus on code quality:

  • Usability oriented: implement easy-to-use methods, in-code documentation, and a bunch of usage examples:
Communicating and easily creating a database with hive-metastore-client
  • Open-source mentality: cyclical-development thinking on the community inputs and acting on feedback.
  • Responsibility decoupling: externalize the Thrift code, segregating the responsibilities between the core functionalities and their communication protocol.
  • Programming best practices: applying code standards and software development frameworks.

“ […] the ratio of time spent reading vs. writing [code] is well over 10:1. We are constantly reading old code as part of the effort to write new code.[…] Of course there’s no way to write code without reading it, so making it easy to read actually makes it easier to write.” — Robert C. Martin (Uncle Bob) on Clean Code: A Handbook of Agile Software Craftsmanship

Main concerns

Our goal was to create an easy-to-use client that abstracts the Hive itself, which is easily achievable via the Thrift framework. But we also wanted to abstract Thrift as well, especially its limitations and not so easy-to-use methods. I mean, Thrift is great for mapping all the communication details, however, it is not user-friendly and lacks finesse for the final user.

The client implements Software Engineering patterns to simplify usage on one hand and to decouple specific thrift objects from the final user. The one in the spotlight here is the Builder Design Pattern:

Builder is a creational design pattern that lets you construct complex objects step by step. The pattern allows you to produce different types and representations of an object using the same construction code. — Refactoring.guru

By supplying builders for creating the objects from Hive Metastore (databases, tables, columns, and so on) the end-user does not need to understand the complex types that Thrift auto-create. The creation for compound types, such as Tables (yes, they are quite tricky) can be made at a slower pace, step by step, keeping your code clean and organized:

hive-metastore-client Builder Pattern Example

We also tackled user implementation by supplying a variety of example code, that goes from using the custom Builders to calling tricky base Thrift methods.

Features

The client seeks to potentialize the features that Thrift had brought to the table and much more:

— Tougher methods: New custom methods that improve the firepower of the base thrift methods:

  • Updating an object: When updating a table (adding, removing, or altering columns) one must provide the whole Table object to Hive, not only the column object to be changed, for instance. The Hive-metastore-client implement methods, such as <drop_columns_from_table> and <add_columns_to_table>, that only expect the Column object (in some cases only its identifier) and behind the curtains, it is in charge of fetching for the current Table definition and re-using it.
    No need for headaches while managing complex objects anymore 💊.
  • If not exists: The ‘if not exists’ command is very common in the SQL universe and it may come in handy when working with DDL commands. Some methods encapsulate this behavior for the end-user, such as <add_partitions_if_not_exists> and <create_database_if_not_exists>.
  • Bulk operations: Some methods in Thrift only allow a single parameter per call, but when you are working with more complex metadata you may want to simplify your operations by calling a method only once for multiple values. The client also handles those kinds of operations to speed up things.

— Abstraction to users: Functions created to simplify the usage from developers, automatizing the most common and most expensive operations. The goal is to automatize the most time and processing consuming actions.

  • An interesting case is the partition management that must be done in the Thrift Table object. When creating a partition, each one must be supplied with its respective full path within the storage location. When creating tables with lots of partitions that can be quite annoying to define one by one. The library suppresses this necessity once it handles the dynamic partition’s paths on its own.

— Communication Handling: Internal methods that are used behind the curtains to simplify the end-user interactions.

  • Error Handling is such a pain when working with any services communication, so we worked to curate silent and tricky errors returned by Hive Metastore and better organize with readable and more direct error messages.
  • Protocol Abstraction functions are implemented to handle the dynamic opening and closing of a connection within calls. This approach reduces duplicated code on calls and guarantees the best practices of client communications.

Thrift decoupling

Another advantage of hive-metastore-client is that it allows the re-implementation of Thrift files. The library stores the base thrift files plus detailed instructions of how to generate the python classes, if needed. It really facilitates the actions since it is not easy to find tutorials online on how to rebuild the thrift files. The library offers the treasure map on a silver plate.

The client also abstracts the Thrift components by decoupling the generated classes from the user-calling methods. It offers a safe ground that if something in the Thrift files of Hive Metastore changes, the usability is not compromised. It counts on a series of tests to guarantee that the interface with the used Thrift methods and objects are not breakable in case of the need for upgrading (both the Metastore or its Thrift files).

“ We should avoid letting too much of our code know about the third-party particulars. It’s better to depend on something you control than on something you don’t control, lest it end up controlling you.” — Robert C. Martin (Uncle Bob) on Clean Code: A Handbook of Agile Software Craftsmanship

Software development best practices

The creation of Hive-metastore-client was oriented by the best practices of Open-Source and General Software development, always focusing on Python code-base quality.

An excellent bedside reading to define our language-oriented thoughts is PEP-20, The Zen of Python. They are 20 Python-oriented statements focused on clean coding and design, such as “Beautiful is better than ugly”; “Explicit is better than implicit”; and “Readability counts”.

Some approaches we took, focusing on having a clear and concise code, were:

— In-code Documentation (and typing): We’ve followed the guidelines from PEP8 to improve code quality and readability. Furthermore, the usage of mypy to add typing to Python was exceptional. It improved the code readability and added a typing validation to boot. It saved us some time while developing the client and made the unit tests way cleaner not needing to care about type checking.

— Unit tests: Unit testing allowed us to better define each code’s goal assertively, taking the software’s architecture to a very decoupled format.

Conclusion

GIF by Late Night with Seth Meyers on Giphy

With the possibility of using the Hive Metastore as a standalone service, and at the same time the lack of an official or well-consolidated tool to easily handle metadata operations, the hive-metastore-client comes to play this part.

For sure it, yet, does not supply everything we wish for, but with the help and interaction of the community, we can grow the project much more! We already have had thoughtful interactions with some early users and their feedback is helping to make the project tougher and tougher.

🙏 Contributing to the project is encouraged and can be made in a bunch of different ways! Just check our contributing guide and get started!

🌟 Have you had some problem using the lib? Do you think of new features that would be amazing? Do you want to discuss some new ideas? — Please, open an issue or get in touch and help us to get better.

Thanks to the client’s co-creators Lucas Fonseca and Juliana Freire; and Ribaldo and Kenji for the arch discussions.

--

--

Felipe Miquelim
Blog Técnico QuintoAndar

Data Engineer @ QuintoAndar. ❤ Data, Football, NBA, NFL & Gaming Enthusiast ❤