Machine Learning algorithms for Healthcare Data analytics (Part 1)

Data Science of healthcare data analytics

There has been information explosion of big data in the healthcare field. Traditional technologies adopted earlier to analyze genomics, DNA, and cancer with trial and methods through Human Genome Project have taken more than a decade to understand and analyze the composition of DNA and the patterns of the data. Big Data Analytics introduced revolutionary tools and techniques to analyze the chronic diseases for prevention and cure. Genome sequencing has been used to understand the potential root causes of tumor growth causing cancer. The data has grown exponentially from terabytes to exabytes. The healthcare data from X-Rays, CT scan and MRI has increased by leaps and bounds concerning the volume of the big data. The advanced technologies of medicine through big data analytics allowed to diagnose the patients records and perform a comparison to a global population to separate the noises from the signal to understand the trends of the tumor growth which was not possible earlier and speed up the diagnosis and treatment. Though there are several theories and techniques that can be applied for the diagnosis of the illnesses, this paper briefly reviews some of the key techniques.

Electronic Health Records

The electronic health record is one of the methods to maintain the entire history of the patient records for analyzing the data for the future as well. The EHR contains significant big data in the form of X-Rays, key observations of the physicians on the fitness and vital signs. The modern big data EHR systems have disparate channels of data sources from the pharmacy, nursing, radiology units, and hospitals through connected network. There are a number of forms to be filled through administration work for registration in a conventional setting of the clinic. However, EHR can automate a large portion of such administrative work including the admission and discharge forms of the patients, billing, and invoicing of the patient check-in and checkout, demographics of the patient. The laboratory systems, pharmacy systems, computerized physicians order entry system, coding systems to organize the healthcare data into particular categories for efficient analysis, and radiology systems are integrated into centralized EHR systems as well. EHR systems data can be both structured and unstructured combining RDBMS and NoSQL data processing techniques.

Phenotyping algorithms through machine learning for diagnosing the diseases

Phenotyping algorithms can be implemented on EHR data on the disease samples from the hospitals to diagnose the diseases. The unstructured data contains large amount of texts from the physicians’ notes, diagnostics, and vital signs records. A phenotyping algorithm is a special technique that sifts through number of clinical data points through the coding systems with particular billing codes, radiology results, and natural language processing of the large amount of texts from the physicians. Machine learning algorithms with supported vector machine can be applied in identifying the rheumatoid arthritis with the combination of prescription records of the patients to improve the accuracy of predictive models of disease. As an example, usage of hypoglycemic agents from the prescription can suggest the indication of pre-existing condition of diabetes. The phenotyping algorithm can be applied for cataracts surgery. The EHR data is also combined with biological data banks. A number of machine learning algorithms for various phenotyping with the aid of coding systems to diagnose the illnesses such as atrial fibrillation, dementia, clopidogrel metabolizers, Type 2 diabetes, sclerosis, and Crohn’s disease can be applied to diagnose and detect the diseases.

The genetic variants can be studied for diagnosing the illnesses through univariate and multivariate analysis of genome-wide association studies through machine learning algorithms based on the disease-phenotype attributes. The predictive model has to be optimized to avoid overfitting and underfitting by choosing the best-fit statistical model with accuracy of prediction model. The methods applied for reinforcement learning are supervised machine learning by building a phenotype of the genotype through a labeled training set data to identify the genetic interactions through the analysis of high-dimensional genetic datasets. Risk models are built for the genetic risk prediction. Bayesian statistics with machine learning techniques can be applied to calculate the posterior distribution. Other methods such as linear regression, logistic regression, and Elastic Net with the variants of support vector machine can be applied for modeling the continuous attributes of the phenotypes.

Decision trees in healthcare field

Decision trees are heavily leveraged in the diagnosis of illnesses in healthcare field. In certain cases, the diagnosis requires constant monitoring of autonomic neuropathy. In the healthcare field, sensors constantly collect the big data from the subject to identify the patterns in the chunks of data sets and for further processing of this data through machine learning algorithms. Identification of cardiovascular autonomic neuropathy through sensors data is the key to understand the vital signs of diabetes. The analysis on this data can be performed through decision trees and ensemble methods. This analysis aids to provide advanced diet and treatment plans for the subject. The research study was conducted by gathering the data from the mobile devices and further the following decision tree and ensemble methods were applied.

· ADTree This technique creates a way two-classification of the problems for generating an alternative decision tree to boost the machine learning.

· J48 Both pruned and unpruned trees are leveraged with this c.45 decision tree classifier.

· NBTree Naïve Bayes Algorithm is applied to generate the decision tree in this instance.

· SimpleCart In this classifier model, the complexity of the pruning is reduced by generating the decision tree.

A large number of ensemble methods are applied such as bootstrap aggregation through resampling of the labeled data with randomization through bagging technique. Boosting the algorithms can aid the sequence of the classified on the trained datasets to accelerate the outputs to the next classifier in succession. Wagging, multiboosting, and adaboost are few other methods that are applied in this research method.

International classification of diseases

World Health Organization maintains coding standards officially as part of United Nation’s efforts to classify number of chronic diseases, epidemics, morbidity statistics and viruses through connected network systems and integrates hospital systems across the globe. ICD-9 in particular integrates with US healthcare systems.

The main challenges surrounding the implementation of ICD-9 are it has a large-scale classification and categorization and currently running out of space mainly to specify the complexity of the conditions, healthcare costs, and KPIs. It also has a number of integration challenges as other countries such as Australia and Canada adopt different codes. The utilization group and diagnosis groups were able to leverage this code. However, non-PPS groups were unable to leverage the code in the healthcare industry. There are both promises and perils with the usage of this code framework.


· Primary effective for the reimbursement systems with the claims information systems.

· The code can track to measure the efficiency, quality, and safety conditions of the healthcare system.

· The code is also highly efficient in setting up the delivery system and policy framework for the healthcare industry.

· The code provides information of the consumers of healthcare on several outcomes of the diagnosis.

· As it is integrated with World Health Organization, it can potentially identify the cause and disseminate the data through international channels for the awareness of the public and hospitals for controlling the epidemics.

· Identifying the trends and practices in healthcare.

· Aids in conducting the clinical trials for big pharma industries for reducing the cost sharing the clinical results and new drug discovery process in a short-time frame.


Though, there are several advantages such as reducing the cost of pharmaceutical drugs and aid standardizing the drug development, collaborative framework for research and development among multiple organizations, this brings a concern for the breach of the privacy of the participants from the clinical trials sharing their health information from electronic medical records through this code as it could potentially imperil the identify of the consumers of the healthcare. According to the research conducted by (Loukides, Denny, & Malin, 2010), this code does not adequately address the privacy concerns. An alternative methodology should be in proposed with the policy framework.

Machine learning algorithms

Machine learning algorithms are applied to the large-scale, multidimensional, and high-dimensional datasets of the healthcare labeled data. The machine learning technique such as principle component analysis for dimensionality reduction is applied for creating the training models for identifying the false positive cases. Unsupervised methods can be applied with principle component analysis of dimensionality reduction for the identification of carotid plaques for detecting cardiovascular diseases. Clustering analysis can be performed through supervised and unsupervised methods. In the supervised method, the machine algorithm clustering method finds the patterns and applies the technique for segmentation and builds the labeled data for predicting the outcome of the test data through classifiers. In case of unsupervised method such as K-means clustering, deep learning methods are applied that can identify the patterns in the clusters autonomously without the requirement of building the training labeled data.

Data mining of sensor data in medical information systems

In the medical field, large-scale big data is generated through the sensor data. There are several sources of such sensor data flowing into the medical information systems such as contextual sensors, wearables, physiological sensors, and human sensors. The tools and techniques for diagnosing the diseases through the data mining of sensor data can be classified into broader categories such as data collection, preprocessing of the data by separating the noises from the signals, data transformation through ETL, and data modeling by applying association rules, knowledge discovery algorithms, classification models, clustering methods, regression models, and final summarization of the KPIs obtained through the data mining by executing the results on the dashboards with business intelligence can aid physicians to diagnose the illnesses and vital signs.

Bayesian networks

Big data analytics can aid in identifying the global outbreaks such as flu based on the anonymized electronic health records of the individuals. The Department of Defense’s Science and Technology from Victoria, Australia has invented an analytic tool EpiDefend and EpiAttack to identify and target the outbreaks occurring globally through Bayesian network machine learning algorithms. The big data is collected through large-scale environmental data for flu, and influenza, hazardous biological agents, and various other outbreaks. The results are drawn through the probabilistic approach of Bayesian networks. The Bayesian network takes the time series of the electronic health records into consideration to track the patterns and trends of the epidemics. The researchers from Defense’s Science and Technology from Victoria leveraged Markovian dynamic Bayesian network approach, text mining to sift through the keywords from the telephone calls to determine epidemics such as anthrax. Particle filtering, Dynamic Bayesian network, and subject-level Bayesian Network algorithms are applied to a large population to determine the outbreaks of epidemics. The text mining is applied on a time series WSARE data sets from the emergency departments that collected the data for the preliminary investigation of the outbreak


Auffray, C., Balling, R., Barrosso, I., Bencze, L., Benson, M., Bergeron, J., … Bock, C. (2016, June 23). Making sense of big data in health research: Towards an EU action plan. US National Library of Medicine National Institute of Health, 8–71.

Dawson, P., Gailis, R., & Meehan, A. (2015). Detecting disease outbreaks using a combined Bayesian network and particle filter approach. Retrieved July 17, 2016, from

Hazelwood, A. (2003). ICD-9 CM to ICD-10 CM: Implementation Issues and Challenges. Retrieved July 17, 2016, from

Kelarev, A. V., Stranieri, A., Yearwood, J. L., & Jelinek, H. F. (2012, September 28). Empirical Study of Decision Trees and Ensemble Classifiers for Monitoring of Diabetes Patients in Pervasive Healthcare. IEEE, 441–446.

Loukides, G., Denny, J. C., & Malin, B. (2010). The disclosure of diagnosis codes can breach research participants’ privacy. PMC, 17(3), 322–327.

Okser, S., Pahikkala, T., Airola, A., Salakoksi, T., Ripatti, S., & Aittokallio, T. (2014, November 13). Regularized Machine Learning in the Genetic Prediction of Complex Traits. PLOS Genetics.

Reddy, C. K., & Aggarwal, C. C. (2015). Healthcare Data Analytics (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series). Boca Raton, Florida: Chapman and Hall/CRC.