[GSoC][LibreHealth] Working With Cassandra for FHIR Analytics
As I discussed previously on my blog posts, LibreHealth is going to adopt FHIR specification to the LibreHealth platform. I’m researching on FHIR Analytics capabilities which we need to provide in the LibreHealth platform.
As my mentors suggest, I have research more on using Cassandra solely for FHIR analytics. Here I’m summarizing my findings about my research.
Cassandra is a NoSQL database which is different type of data storage when comparing to RDBMS databases. Cassandra is a write optimized database which doesn’t worry too much about the normalized data. According to Cassandra community, it’s totally fine to have redundant data in single table if it helps to speedup the data reads from the Cassandra.
According to article in, it’s required to carefully modeled the data into the Cassandra table. Because if proper indexes aren’t there, CQL will reject the execution of specified queries with WHERE clauses.
There are four different ways that single column can behave. If we take a single Cassandra table, it will consist of partition keys, clustering keys, secondary indexes and regular columns.
Partition columns : Partition columns are one or group of columns which used to identify the partition which particular data is reside in the cluster. When user executes a query with where clause, it will first look for the partition key to identify the partition which requested data is reside.The partition key columns support only two operators: = and IN
Clustering columns: Clustering columns are used to group data within a single partition. This key can form with single column or group of columns. Clustering key will be use to identify the clustered data sets within a partition which will speed up the filtering specified in WHERE clause of the statement. Clustering columns support the =, IN, >, >=, <=, <, CONTAINS and CONTAINS KEY operators in single-column restrictions and the =, IN, >, >=, <= and < operators in multi-column restrictions.
Secondary Indexes : Cassandra provide direct support for the searching on secondary indexed columns using =, CONTAINS or CONTAINS KEY only. If user wants to filter a regular column or secondary indexed column using =, >, >=, <= and <, CONTAINS and CONTAINS KEY operators, then ALLOW FILTERING OPTION should be specified. However ALLOW FILTERING should be carefully used as it can impact performance severely as in . IN operator restrictions do not supported for secondary indexed queries.
CREATE TABLE numberOfRequests (
PRIMARY KEY ((cluster, date), datacenter, hour, minute))
CREATE INDEX ON numberOfRequests (numberOfRequests);
In this table cluster and date columns used as a partition key
Columns after partition key which are datacenter, hour and minute used as clustering keys
Last it’s created a secondary indexed on column numberOfRequests.
According to my analysis, using CQL only for the FHIR analytics will require several columns to be indexed. Cassandra also not encourage to have large number of column indexes as it’s again affects to write performance. With CQL, dynamic query builder using FHIR resource attributes is highly challenging task.
FHIR resource is more of a self contain resource. But single FHIR resource constructed using complex type of sub elements. With our storing model, we are planning to store most of the sub elements as texts. For example, address in patient resource will be completely stored as text irrespective of it constructed using several fields. Hence with Cassandra it’s highly challenge to filter based on elements of address field.
Most of the big data forums and posts suggest, use Cassandra along with Spark which enable more flexible query options which operated on distributed manner. With spark, user can map data into Spark model and use Spark SQL efficiently to filter and write many complex queries.
It’s always good to read about the things that are new to my knowledge. I’m really thankful for my mentors, LibreHealth and GSoC for this opportunity.