Python RESTful API Server for Apache Hive

Rain Wu
Random Life Journal
3 min readApr 24, 2020

Continue from the previous note Associate Apache Hive Client with HDFS, this time I will share the design and implementation of RESTful API Server, which will act as an interface for user access Apache Hive without raw sql command.

Environment

One of the packages to be used is PyHive, it’s the most popular open source tool for connecting hiveserver2. It also needs to rely on several other packages to run, including thrift, sasl, and thrift-sasl, below is the Pipefile.

Most of the time, it will throw an error while installing any of sasl-related package above if you’re using Ubuntu. They needs some tools help them work correctly, this dependency is totally a mess.

libsasl2-dev is necessary, but libsasl2-modules is for the purpose of reliable access to Hive, if you connect with localhost(a.k.a 127.0.0.1) directly without public ip or private ip from outside the host, you don’t need it.

$ sudo apt-get install -y libsasl2-modules libsasl2-dev

Users can only obtain tools above with apt-get install, which means a slim base image without apt-get can not be used while packing docker container image. And within the container, private ip would be used for connection inside LAN, so I need both of them.

Here’s the Dockerfile, you can set up the environment depends on it.

According to my experience, I won’t use PyHive if there is another reliable choice, but no. In addition to the complex dependencies that make packaging difficult, error messages are often a bunch of broken error logs that have not been arranged, it’s really very inefficient in debugging.

Server

After set up the environment, we can implement each part of our server. I choose flask as the web framework, and sqlalchemy for initializing database schema this time.

The first thing I started with was client class encapsulating, make sure database is connectable and provide the necessary foundation for subsequent functions. Meanwhile, follow the principle of high-cohesion to implement CRUD operation inside.

Next is the handler function for each endpoint, a extended package of flask called Flask-RESTful would be used here, it provide agile way to register handling logic.

Then comes the flask app essence, some basic error handling was recommanded, it can speed up the communication during developing.

I still use pytest for unit testing and invoke for task management, but that’s not the major part of API server so I didn’t demostrate here. BTW, if you are not familiar with shellscript, invoke might be a better choice than makefile.

Deployment

The easiest part, bring up your .env file, export the specified ports you used, then start it with docker-compose.

$ docker-compose up -d

Reference

Most of the issues I encounter is related to PyHive, one of the most tricky is it looks like Update and Delete operation are still in beta. Some modifications to the hive-site.xml we mention in previous note is needed, if you wanna enable those two operation.

Another is about the ubuntu package libsasl2-modules we mention above, after I using private ip for connection within docker container, I encounter the error which not allow a unreliable way to access Hive, the package is the best solution.

--

--

Rain Wu
Random Life Journal

A software engineer specializing in distributed systems and cloud services, desire to realize various imaginations of future life through technology.