Fighting with Apache Phoenix. Secondary Indexes

This investigation cost me the first grey hair.

We’re all used to work with Relational Database Management Systems and it is great experience. All these databases like Oracle, PostgreSQL provides so much versatility in usage, developers are able to find any record by any criteria with almost no effort. One of the most important aspects with RDBMS is high performance indicators and it comes with just a basic tuning.

Things are changing when we trying to organise high scalability and availability for your services by introducing distributed architecture or micro-services. Obviously, you choose Apache Hadoop and HBase as a filesystem and data storage which is a great decision and technological standard by itself. However, HBase does not provide such advance API for working with database objects as RDBMS does. Performance for scanning and traversing by random field value is still much lower due to its distributed filesystem and column oriented structure. In a race for continuous improvement we introducing one more tool to our technological zoo — Apache Phoenix, as SQL wrapper and more specifically OLTP for HBase.

Right here real challenges appear.

For out basic performance and explain plan overview we are going to use 3 tables:

  • USER_DATA_150K (150 000 records)
  • USER_DATA_15M (15 000 000 records)
  • USER_DATA_30M (30 000 000 records)

These tables have completely identical structure:

USER_DATA_150K / USER_DATA_15M / USER_DATA_30M Tables Structure

Select by Primary Key (REC_KEY)

SELECT REC_KEY, REC_STATE, REC_CREATE_DATE, REC_EMAIL, REC_ FIRST_NAME, REC_LAST_NAME FROM TEST.USER_DATA_XXX WHERE REC_KEY = 'xxx'
Performance Metrics — Retreive by Primary Key

We may notice already that performance indicators for these three cases within the margin. The most impressive fact to me is good read operation performance results out of box which Apache Phoenix shows, although it is more in regards of HBase.

Filtering and Sorting by Table Fields (REC_LAST_NAME and REC_CREATE_DATE)

It’s time to test our performance by generating select statements to our tables fields. It is going to be challenging task for HBase and Apache Phoenix due to their internal data structure.

SELECT REC_KEY, REC_STATE, REC_CREATE_DATE, REC_EMAIL, REC_ FIRST_NAME, REC_LAST_NAME FROM TEST.USER_DATA_XXX WHERE REC_LAST_NAME = 'xxx' ORDER BY REC_MODIFY_DATE
Performance Metrics — Filtering and Sorting by Table Fields

Metrics which are show here are frightening as for one of the most popular solutions on the market. Let’s figure out what exactly happened and went wrong here during our tests:

  1. We have added 2 criteria to the statement.
  2. These criteria are not a part of primary key, just a regular table fields values.
  3. Apache Phoenix does not have any ideas where specific record(s) corresponding to this criteria might be stored, so it starts scan thought entire table instead of making a scan over the region.

It’s time to introduce Secondary Indexes to our tables which can help with increasing filtering and sorting queries performance indicators. There are 3 ways to make Phoenix’s Global Indexes efficient:

  • Include all fields into index. Pros: guarantee that searching operation will go by index table. Cons: MAJOR performance downgrades for UPSERT and DELETE operations, index’s table size grows and by default it’s wrong architectural decision unless your perform a search by all table fields.
  • Index hinting. Pros: lightweight approach, haven’t been noticed problems with table out of sync (critical bug in version 5.0.0 and below: PHOENIX-4455). Cons: doesn’t guarantee traversal through index table.
  • Covered index. Pros: guarantee that searching operation will go by index table. Cons: index’s table size grows.

Now, we can apply these 3 Global Index strategies separately to our to USER_DATA_15M table and compare results.

All these manipulations had a positive impact on 15M table and decreased the time for retrieval in 20(!!!) times.

At this moment, Apache Phoenix version 5.0.0 has few extremely vulnerable issues (PHOENIX-4455 and PHOENIX-3845). Due to these and other similar bugs we can’t use Global Indexes in a full range. These issues are much more dangerous than might seem, I’ve been observing them by myself and trust me, this kind of routine like debugging a database and observing that for some reason/at some point index and table are out of sync is horrible. I can only imagine that somebody's Production Environment might have this bugs.

Bad stuff.

From the other side I’m here to help you. So, there several promising ways to workaround this.

Transactions. Perhaps, this one is the most controversial solution.

  • This is not Big Data way of doing things, by using transactions you start loosing write performance and violating WORM Big Data principle — “Write ones, read many“.
  • You have to include Apache Tephra as a transactional service to already exciting ecosystem, which is not a bad idea by itself, but there is one but: this is Apache’s incubator project and no major work was done since year 2017, so forget about any type of support.

Local Indexes. Phoenix had changed the structure of Local Indexes in version 4.8 and above, they are not going to have inconsistencies within table and index, however, Local Indexes are not perfect. Think twice before using them all over the place due to low read intenvity.

These are several important notes I’d like to share with you which will make your experience with Apache Phoenix much productive:

  • Global Index will not be used by Phoenix unless all of the columns referenced in the query are contained in the index. A Local Index can be an alternative to a Global Index.
  • After including all fields into Secondary Index and introducing Covered Indexes the index’s table size increased in 2 times comparing to Index Hinting strategy. It depends mostly on amount and type of fields of your tables. To check the index’s table size run the following command on Hadoop’s Namenode:
$ hadoop fs -du -h hdfs://{PATH_TO_HBASE}/data/data/{SCHEMA}/{INDEX}
  • An index fields ordinal position should be chosen in a way that aligns with the common query patterns — choose the most frequently queried column or column which is going to stand always in front of every filter criteria as the first one and so on.

Apache Phoenix is a great tool, it enables so much versatility to Apache HBase. However, it takes an effort to setup it correctly for your project/business needs especially while having open blocker issues even with the latest Phoenix version 5.0.0.

Take it easy!

Engineering technologist, full-time lucky man https://twitter.com/vkrava4