Apache Cassandra for HealthCare Data Analytics

Dr. GP Pulipaka
4 min readSep 26, 2016
Apache Cassandra and Apache Pig components for Hadoop and MapReduce ecosystem

An industry such as healthcare requires health information data analytics that is capable of processing real-time streaming of the data to resolve distributed computing environmental conundrums. The health information data analytics systems deal with several types of data sources such as drug database, electronic medical records of the subjects, anonymized historical clinical values of the subjects obtained from clinical trials to perform exploratory analysis. The data flows into healthcare industry with large-scale volume and high-speed velocity. The data also rapidly moves in from the sensors, and monitors requiring instant dashboards and KPIs of the subjects to understand the patterns and trends. For such distributed environment Hadoop framework with NoSQL database to process either the document store or key-value pairs is required. This may require a combination of technologies such as Apache Storm for real-time streaming data and NoSQL for data ingestion, storage, and processing. Apache Storm consists of majorly three nodes knows as Zookeeper, Nimbus, and Supervisor. Under each Nimbus, there are multiple sub nodes for Zookeeper and Supervisors as sub nodes to the Zookeeper. Apache Storm requires persistence with the addition of either MongoDB or Apache Cassandra to the topology in a distributed environment can work for healthcare information data analytics (Saxena, 2015).

Apache Cassandra and Apache Pig integration can provide great results for MapReduce jobs by converting the Pig Latin Scripts into MapReduce ecosystem. Apache Cassandra is delivered with native support for Apache Pig with the built-in package package.org.apache.hadoop.pig (Mishra, 2014).

Apache Cassandra in combination with Apache Storm is suited when there is a requirement for key-value pairs for health information data analytics because, it has columnar database storage, when compared to NoSQL databases such as Apache Hive, the scalability and speed at which it writes to the database store is higher. Apache Cassandra is designed in such way, which works for health information data analytics without a single point of failure. The read and write operations do not conflict with each other. Apache Cassandra can scan through millions of healthcare transactions from electronic health records to billing systems at a blazing speed. Apache Cassandra is built on CAP theorem to strike eventual consistency for health information data analytics systems dashboards to display most up-to-date and most accurate data (Saxena, 2015).

In a large-scale distributed computing environment of health information data analytics system, it’s possible to have multiple data centers in the cloud-computing environment. In this scenario, Cassandra clusters have to be set up across the data centers to keep it failure-resilient and avoid a single point of failure and centralized disasters. Apache Cassandra has to be installed on each node of the data centers and allocate the IP address per node in the cluster. Apache Cassandra in such environment can be accessed through multiple client APIs such as Thrift protocol, Hector, and DataStax Java driver, and Astyanax. In some cases of health information analytics, the hospitals emergency departments will be dealing with epidemics and outbreaks of diseases, they will sift through the call detail records generated through the telephone calls to understand the patterns, trends, and geographical locations of the incidents. This can be achieved through Apache Storm topology in combination with Cassandra. The CDR data flows from the telecom to ER units of the hospitals and through real-time funnel in Apache Storm topology layer and transfers through middleware Rabbit/MQ broker, to reach the Spouts of Apache Storm. The data gets further processed into decipher the CDR data, and finally gets written to Cassandra writer bolt. The data gets written leveraging Hector API to large-scale Cassandra clusters installed on the nodes. When a node joins Cassandra cluster, it alerts a ring for the first time, this process dubbed as Bootstrapping. Apache Cassandra can handle the error and failure scenario handling by identifying the node or cluster that is not performing and can recover the process. Scaling Apache Cassandra cluster does not require downtime. Apache Cassandra also allows updates to the replication factors with fault tolerance as the nodes go exponentially high in sizes.

MongoDB NoSQL database is best suited for healthcare information analytics when there is a need for document stores. MetLife has implemented MongoDB NoSQL database. MetLife has disparate healthcare systems in excess of 70, claims management systems, and a number of other healthcare data sources. MongoDB was implemented in a span of 90 days. MetLife contained structured, unstructured, and semi-structured databases. It required a NoSQL for data ingestion process to glean the data out of all the systems without building schema-designs and data warehousing architectures through extraction, transformation, and loading processes. MetLife has an array of database systems that are from different software vendors and home-grown database systems. MetLife requires FDA compliance to meet the regulations. MetLife has large-scale images of death certificates and health information records. MetLife stored all the customer health information records in JSON (JavaScript Object Notation) format. MongoDB could extract all the data in JSON documents, thus allowing MetLife to bring a unified view of individual policy and group policy documents together from raw data without applying cardinality rules such as data normalization. MetLife leverages a service-oriented architecture with web services. The prototype was performed on two million documents. The entire project implementation from the start to Go-Live took 90 days (Henschen, 2013).

References

Henschen, D. (2013). MetLife Uses NoSQL For Customer Service Breakthrough. Retrieved July 20, 2016, from http://www.informationweek.com/software/information-management/metlife-uses-nosql-for-customer-service-breakthrough/d/d-id/1109919?

Mishra, V. (2014). Beginning Apache Cassandra Development (1 ed.). New York City, NY: Apress .

Saxena, S. (2015). Real-time Analytics with Storm and Cassandra. Birmingham, UK : Packt Publishing.

--

--

Dr. GP Pulipaka

Ganapathi Pulipaka | Founder and CEO @deepsingularity | Bestselling Author | Big data | IoT | Startups | SAP | MachineLearning | DeepLearning | DataScience