Dynamic Columns in Phoenix: A Tale of Two Use Cases
By: Yanan Valencia
Apache HBase is a NoSQL database, or to be more specific, a wide-column store that allows the name and format of columns to vary from row to row in the same table. It also supports the notion of column families, where columns that have similar access patterns are stored together.
Apache Phoenix is a sql layer on top of HBase. Phoenix enables us to run sql queries like RDBMS while maintaining horizontal scalability of HBase. At TrueCar, we use HBase/Phoenix as horizontally scalable OLTP database to work in conjunction with Spark, Spark Streaming and MapReduce. Since HBase has auto-sharding, we store most of our historical data in HBase. This setup freed us from the typical restrictions of relational databases, and enabled us to design schemas in a creative way that can be used for special use cases.
In this blog, I would like to share two use cases of how we utilize the dynamic columns in Phoenix to achieve the following:
- Looking up incentives with postal codes.
- Storing price and odometer updates of used vehicles.
Incentive lookup with postal code
“Buy a car, get a check,” the history of car incentives started with this famous line in a Chrysler commercial during the halftime of Super Bowl IX on the day of January 12, 1975. It was a $200 cash back at that time, but now car buyers can benefit from many different types of incentives from 0 percent financing to membership programs, with amount up to thousands of dollars. TrueCar lets users easily select applicable incentives from a list when checking pricing online, and the accuracy of the time-sensitive incentives is what we always keep in mind.
The incentives data, collected from manufacturers, is exported using Sqoop from a relational database to S3 in small pieces: one table with type and amount, one table with incompatible ids, one table with applicable regions and their postal codes, etc. Our pipeline joins all the tables together to provide the flattened dataset to other downstream processes. And one of the bigger challenges was to store postal codes associated with each incentive record.
As a result, we created a few column families of postal codes to store the flag whether an incentive is applicable in the area. The table definition looked like this in the beginning, separating postal codes into two columns families, even and odd:
And later on, to further optimize the performance, we mapped the postal codes into four groups based on the residual value after dividing the number by four.
With the above sample table, to check if any vehicle has incentives in postal region 90069, one can simply run the below statement and get “vehicle 1” in the output:
In this case, using Bloom filters on the row and column family level (ROWCOL) helped us reduce unnecessary reads by scanning only the column families containing the postal codes we are searching for. With close to 40K dynamic columns (postal codes) in this table, we reduced the read time by half using four column families compared to two as was originally designed.
Used vehicle price and odometer history
Another example is when we wanted to create a history of used vehicles, and allow Data Scientists to analyze the change of selling price and odometer. Instead of having one new record per update, the idea of assigning dynamic columns for price as well as odometer came to our minds, keeping all the history within one row for each vehicle. The schema for that is:
A Data Science model was then built on top of this data set, providing a daily update of price ratings of all used vehicles in the inventory of TrueCar dealers. This gives both consumers and dealers a deeper insight into the pricing of these vehicles in comparison to the market average.
Dynamic columns are usually used when defining a static schema is not feasible at table creation, and when new columns will be added later on as the schema evolves. As demonstrated by the two use cases above, we utilized the elasticity of column definition varying among different rows, and made storing different numbers of the same type of fields possible.
Hopefully this blog provided some insight into some of the fun engineering problems that we solve here as part of the TrueCar Data Engineering team. We will continue to share similar interesting learnings in the near future.