Phorm: Helpshift’s simple ORM for Apache Phoenix

Published in

helpshift-engineering

7 min readJul 29, 2019

This is the second instalment in a two-part series on Helpshift’s story of building Analytics API using Apache HBase + Phoenix. If you haven’t read Part 1 of this series, I recommend checking that out first!

Introduction

We learnt in Part 1 of this series that one of the biggest advantages of using Apache Phoenix is that, it provides access to HBase using an abstraction that most programmers are familiar with, namely SQL and JDBC. Moreover, Phoenix provides a relational schema abstraction and data modelling capabilities on top of a columnar datastore like HBase. This further helps the programmer to easily model their data tables in a simple relational schema for their business use-cases.

We also established how SQL on HBase using Phoenix empowered us to develop a big query platform for Helpshift’s Analytics APIs. This query platform involves data scans, filters and aggregations on big data scale, that too at low latency. It makes this possible by orchestrating distributed query executions at parallel, bringing the computation to the data by i) pushing the predicates in the WHERE clause to server-side filters (mechanism known as “Predicate Pushdown”) ii) executing aggregations through server-side hooks (called co-processors in HBase).

SQL and JDBC is all fine, why need an ORM ?

ORM or Object-relational mapping, as you may know, is a programming technique of representing incompatible type systems like database records to objects in an object oriented programming language like Java. Hibernate and MyBatis are some popular ORM frameworks used for RDBMSes like MySQL and PostgreSQL.

ORM frameworks are known to have a scary reputation. This is primarily because of reasons like

developers needing to learn yet another framework
their tendency to be slow and bloated
a lack of understanding of what the framework is doing under the hood
difficulties in modelling complex queries at times

Some of these problems maybe justified in certain scenarios while some may be completely unwarranted in the same scenarios. Hence, it is important to weigh in the advantages of using them with the complexities and hurdles that they bring in for your own use-case.

Enter Phorm

Phorm or Phoenix ORM is a simple ORM library written in Java. It is used by our Analytics API server (Sunbird) to run queries on Apache Phoenix using the ORM technique as described above.

It is responsible for translating annotated POJO (plain old Java object) classes called Entity classes to Phoenix JDBC API requests. It is in turn responsible for mapping the JDBC result-set response back to Entity classes. It uses javax.annotation API to inject table schema and query generation behaviours into the Entity classes and to also map SQL result-sets obtained from the data-store back to the entity classes.

It embeds Apache Phoenix JDBC driver which communicates with the Phoenix Query Server (PQS) over HTTP/Protobuff. PQS is an implementation of “JDBC over HTTP server protocol” interface provided by Apache Calcite.

At the time of building Phorm there was an open source ORM framework made available by eHarmony for Phoenix, called pho. The APIs are very similar to that of Hibernate and JPA (Java Persistence API) with a flavour of Spring Framework’s XML based bean configurations. It was more difficult to deploy, debug and monitor. For the purpose of building a minimum viable infrastructure we chose to build our own ORM library.

Our exercise of weighing in the advantages of using the ORM technique over vanilla JDBC against the complexities of integrating a fairly complicated ORM framework led us to build our own custom ORM library i.e Phorm. Following are some of the reasons that led us to take this approach:

We are on Hortonworks Data Platform (HDP 2.6.x) and the library needed tighter integration with HDP’s dependencies
Tighter control on the implementation and performance, with need for only basic ORM capabilities instead of full fledged features of a standard ORM framework
Abstracting JDBC connection management and transaction commits
Eliminating repetition of all the JDBC API and result-set to entity mappings code
Abstracting query generation using annotated entity classes
Abstracting the result-set pagination with the Hybrid pagination approach for Analytics API as described in Part 1

Following are some of the disadvantages if we were to not build an ORM library and instead just use plain JDBC

Lose out the benefits of the “static typing” that comes with annotated Java entity classes. Raw SQL statements are prone to mistyping and everyone likes the compiler to catch them instead of finding a nasty SQL error later
Repetitive duplication of similar SQL and JDBC code
End up mapping database result-set to entity classes manually by repeating a lot of code
Miss out on providing a simple abstraction for developers to easily implement the data-access routines on Phoenix for churning out new Analytics APIs quickly

How we built a simple ORM library

In the following section we go through the process of designing a simple ORM library that can be used to build a query platform for running both aggregated and non-aggregated queries on SQL data-stores like Phoenix.

The complete source code for this is available on Github at https://github.com/abhilater/simple-orm. It uses the same idea on which Phorm is built.

Following is a representation of the code flow in Phorm. It provides an overview of how a developer interacts with Phorm’s API to access Phoenix table data as paginated entity objects.

Figure 1: Sequence diagram describing the code flow in Phorm

In order to access table data a developer needs to

Write an Entity class representing the data access for the corresponding table. The Entity classes use custom data annotations to define the table schema and query behaviour. It can used to perform both aggregated and non-aggregated queries on the table.
Use the DataAccessor API for the above entity class to define custom filtering, aggregation and pagination criteria for the data access.

The framework is responsible for abstracting away

Entity class translation to SQL query using the QueryGenerator
JDBC connection and transaction management, JDBC prepared statements and bindings, JDBC query requests, pagination management using the DataAccessor manager
Automatically mapping JDBC results back to Java entity objects using the ResultSetToEntityMapper.

Let’s go a little deeper into the workings of each of these components.

Data Annotations

Data annotations are used to inject database table schema and query behaviour in the Entity classes (described later). These behaviours are injected both at the class level as well as the field level.

Class level annotations called @Table specifies which database table this entity corresponds. Whereas the field level annotation i.e @ColumnField does a whole lot more as described in the following code (the documentation on each of the fields of ColumnField interface describes its purpose).

Figure 2: Field level annotation for the ORM Entity classes

Figure 3: Class level annotation for the ORM Entity classes

Phoenix DB Schema

Following is a representation of the table schema for which the Entity classes need to be defined.

Figure 4: Phoenix table schema with sample data

Entity Class

Entity classes are Plain Old Java Object (POJO) classes with some behaviour injected using the ORM annotations described above. These classes are Java object representations of the database rows that are queried using the same query behaviours added in the field level annotations.

Following is an Entity class definition for agents_metrics_table which is used to run aggregate group by queries on the table for servicing the Agent’s analytics API.

The first three fields are a part of the primary key of the table as defined by the @ColumnField annotation along with their primary key index positions. This primary key positioning property is used to order the group by columns by the QueryGenerator (described later).

The field type enum definition on each field is responsible for the type conversions from SQL types to Java types during the ResultSetToEntityMapper phase (described later).

The columns that are grouped by in the resulting aggregate query are also marked using the field annotation. Finally the annotation also adds the function (scalar, aggregate or user defined function) that is supposed to be run on a column field as shown in the following code.

Figure 4: Entity definition for agent_metrics_table

Similarly following is an Entity class definition for issue_metrics_table which is used to run non-aggregated queries on the table for servicing the Issues Analytics API.

Figure 5: Entity definition for issue_metrics_table

DataAccessor

This is the primary API interface for the developer to access the table data. This interface is used to define
i) The Entity class for the data access
ii) The custom query filter criteria
iii) The custom group by or aggregation criteria
iv) The pagination criteria

Following is how we define and use the Data accessor API for querying IssueMetrics and AgentMetrics entities

Figure 6: Demonstrating the use of the Data Accessor API for querying the table data entities

And here is the DataAccessor manager class implementation.

This class uses the Builder Pattern to allow the developer to build their custom filter, group by and pagination criteria for querying the table entity objects. It is made to be used for any entity class using Java Generics API.

It makes use of the QueryGenerator to build the query and ResultSetToEntityMapper and EntityFieldsCache to map the SQL records to the Entity objects.

It is responsible for abstracting the Database connection management and JDBC API code to pull the query results from the database table. It is also responsible abstracting the pagination on the query results and sending the next_page_token to the client along with the paged response i.e PagedResult<T>, if more rows are to be queried.

ResultSetToEntityMapper

This components is responsible for mapping the JDBC result-set rows to the Java Entity objects. It makes use for Java Reflection API create and populate the objects at runtime. While doing this it makes use of the field annotations to pickup the data type properties needed for type conversion from SQL types to Java types.

The detailed source for all the other components are available at https://github.com/abhilater/simple-orm. Please check it out and feel free to contribute and make changes.

Final Thoughts

The take away from working on this simple ORM library and later reaping the rewards from its simplicity was a reinforcement about Rich Hickey’s message from “Simple-Made-Easy” that — “We should aim for simplicity because simplicity is a prerequisite for reliability”.

Thank you for reading.❤

Abhishek

Data Engineering is an exciting space right now, especially as we scale up our data pipelines to work for more and more data every quarter. We have many opportunities to solve real problems with deep focus on computer science and provide great value to the business. If this type of work sounds appealing to you, reach out to Kiran to get the ball rolling.

If you haven’t read our earlier posts on building Analytics APIs using Apache Phoenix and HBase you can read it here Engineering Analytics API with HBase, Phoenix and SQL at Helpshift.