Data scientists are there to solve problems bridging the gap with domain experts. Read how they accomplish this.

9 min readSep 5, 2020

Data scientist escaped from craving under a rock

The moment data scientists came out of the closet, they were no longer is considered the geeks surviving in the most dilapidated offices at the end of the corridor. Before that moment, nobody knew exactly how they spend their time very well. Often they were seen as moonshiners with computers screens never off. That was the time where data scientists creatively played around with any relevant data they put their hands on.

Specific domain related problems arriving at the data scientists office were scarce, not to say absent.

Data scientists have been around for centuries as mathematicians, physicists, biologists, economists before business people or domain experts in general got interest in them. For decades, they prepared for something what was not considered imageable: a society with an explosive growth of available datasets and computational power as a commodity service.

From the moment data scientists started solving problems for domain experts, their work evolved from research to deliverables and accomplishments, a new approach arose where both the domain expert as the data expert were required to understand each others core business.

As domain experts speak a language covering domain expert problems, or physicians only think within the boundaries of patient related data, data scientists need to translate this heterogenous data into data problems.

In my professional environment I am working as data scientist, anesthesiologist and emergency physician, I am wearing both hats as quality improvement and benchmarking have become the key performance tasks these days. I constantly ping-pong between the two worlds trying to find the net in the middle. In a PLOSOne paper we mentioned this observation earlier:

“The critical care sector generates bountiful data around the clock, which can paradoxically complicate the quest for information, knowledge, and ‘wisdom’. The accumulation of clinical data has outpaced the capacity for effective aggregation and analysis aiming to support clinical quality, patient safety and integrated patient care. Intelligent data analysis promises a more efficient representation of the complex relations between symptoms, diseases and treatment. Additionally intelligent data analysis hopes for a reduction of cost of care and faster design and implementation of clinical guidelines. In this respect, the secondary use of clinical and operational data could support comparative effectiveness research, data mining, and predictive analytics. Commonly used data analysis platforms in clinical practice, frequently only provide support for data integration and monitoring, leaving all the analysis and decision taking to the clinical end-users. The clinical end-user is not in the position to constantly monitor and process the large amounts of data generated by patient monitoring and diagnostics. The potential of predictive analytics is to provide the clinical end-user with validated medical decision sup-port and ultimately leading to more Predictive, Preventive and Personalized Medicine — PPPM. PPPM is an integrative concept in health care that enables to predict individual predisposition before onset of the disease, to provide targeted preventive measures and create treatment algorithms tailored to the person. PPPM relies on the potential of large amounts of heterogeneous data collected in medical environments (electronic health records, medical texts and images, laboratory tests etc), but also from external data of increasingly popular wearable devices, social media etc. Data driven predictive algorithms often fail to provide self explanatory models due to high-dimensionality and high-complexity of the data structure leading to unreliable models. Also, successful predictive analytics and application of cutting edge machine learning algorithms often demands substantial programming skills in different languages (e.g.Python or R). This migrates modeling from the domain expert to the data scientist, often missing the necessary domain expertise, and vice versa, domain experts are not able to perform ad hoc data analyses without the help of experienced analysts. This leads to slow development, adoption and exploitation of highly accurate predictive models, in particular in medical practice, where errors have significant consequences (for both patients and costs). In this paper, we address this problem by exploring the potential of visual, code free tools for predictive analytics. We also review the potential of visual platforms (RapidMiner, Knime and Weka) for big data analytics. As a showcase, we integrated the MIMIC-II database in the RapidMiner data analytics platform. Data extraction and preparation was performed on a Hadoop cluster, using RapidMiner’s Radoop extension. Further, we used RapidMiner Studio in order to develop several processes that allow automatic feature selection, parameter optimization and model evaluation . The process compared several learning methods (Decision Stump,Decision Tree, Naive Bayes, Logistic Regression, Random Forest, AdaBoost, Bagging, Stacking,Support Vector Machine) in association with feature weighting and selection quantitatively assessed in terms of Correlation, Gini Selection, Information Gain and ReliefF.”

It is the responsibility of the data scientist to convert these ambiguous domain expert problems into data problems as a well defined problem is winning half the battle.

Problem Statement

More as ever before it is essential to follow a structured, stepwise approach to translate problems as defined by domain experts into data problems. Companies such as Analytics Vidhya provide courses where you learn to translate the problem description of the domain expert (e.g. business, medicine, etc) into a data science problem.

The first two steps (problem statement and hypothesis generation) in the data science life cycle are often skipped or at least deserve more attention.

Translating the actual problem (business, medical) into a data problem requires a necessary stepwise approach by the acronym “TOSCAR”. In this example I will demonstrate how “TOSCAR” can be implemented prior to any analysis.

The context is the following: The APACHE IV score is widely used as illness severity score for decision support in the critically ill. Data sources are publicly available for > 200K ICU admissions for >139K unique patients across the United states. The question arises whether these databases are well calibrated to be used elsewhere outside the US. Also if the APACHE IV score better predicts outcome on the global population or for specific patient subgroups? Finally it would be very helpful to unravel the attributes used for the score, where dimensionality reduction could reveal the most relevant attributes predicting outcome in the critically ill patient population.

Trouble:

Outcome prediction for multi-morbidity patients is a very complicated task. Critical illness is a life-threatening multi-system process that can result in significant morbidity or mortality. In most patients, critical illness is preceded by a period of physiological deterioration; but evidence suggests that the early signs of this are frequently missed. All clinical staff have an important role to play in implementing an effective ‘Chain of Response’ that includes accurate recording and documentation of vital signs, recognition and interpretation of abnormal values, patient assessment and appropriate intervention.

The trigger responsible for patients to become critically ill can be obvious (e.g. major surgery, multi-trauma, sepsis), but can also be intangible for patients with co-morbidities to end up on ICU (elderly, immunosuppressed patients).

The data sources in this case are siloed and can be considered as big data with respect to variety, volume, veracity and variability.

Missing values and outliers are frequently encountered while collecting data. The presence of missing values reduces the data available to be analyzed, compromising the statistical power of the study, and eventually the reliability of its results. In addition, it causes a significant bias in the results and degrades the efficiency of the data. Outliers significantly affect the process of estimating statistics (e.g., the average and standard deviation of a sample), resulting in overestimated or underestimated values.

Although illness severity scores are regularly employed for quality improvement and benchmarking in intensive care units (ICU), poor generalization performance, particularly with respect to probability calibration, has limited their use for decision support

2. Owner:
It is interesting to note that several regions in the world consider the patient as the lawful owner of the data. Ethical committees and clinical trial units act as review committees to analyze study protocols. In Europe, the analysis follows GDPR regulations. The owner in the context of this problem statement, is a large general hospital in Belgium.

Improvement of patient safety and quality of health care in the international community by offering education, publications, advisory services, and international accreditation and certification are essential today by the promotion of rigorous standards of care and by providing solutions for achieving peak performance and a better patient outcome.

3. Success Criteria

For this case study, success can be defined as successfully building, deploying and monitoring a model able to predict survival of critically ill patients using the APACHE IV score.

Is this approach resulting in a resilient predictive model?
Are there other ways to achieve the same result?

4. Constraints

The following questions can be asked:

Why is this problem relevant today?
What is in scope? What is out of scope?
Is there a budget to build, test, deploy and monitor a predictive model?
How could a return of investment be defined?
How balanced is the dataset to represent the entire population in terms of race, gender, age?

etc…

5. Actors/Stakeholders
Patients, family, physicians (critical care, emergency physicians, respiratory care physician, nephrologists etc), psychologist, social worker, CIO, CEO, Chief Medical Officer, etc, all are actors playing a different role in the data science problem.

6. Reference
Was the problem already solved in the past in our population? Or in another population?

Once we understand the domain expert (medical) problem, we should frame the problem statement followed by breaking the medical problem into smaller problems. Only then, the smaller data problems can be converted into data problems and solutions can be found for the data problems.

Convert business (domain expert) problem to data problem.

In this case, the breakdown into smaller problems could be:

Are the databases sufficiently calibrated and are they validated to be used elsewhere outside the US.
Is the APACHE IV score better predicting outcome on the global population or for specific subgroups?
Are all attributes of the APACHE IV score required to predict outcome or should the list of attributes be enlarged or reduced? Could a reduced APACHE IV score provide an equally valuable outcome prediction in the critically ill population?

Hypothesis Generation

In contrast to a non-hypothesis driven data analysis, a hypothesis driven data analysis creates a list of hypothesis BEFORE looking at any data, this enables smaller & specific pieces to work with and provides an UNBIASED view.

A hypothesis is a possible view or assertion of a data scientist about the problem he or she is working on. It may or may not be true.

Problem statement, hypothesis levels, exhaustive set of hypothesis.

Mapping hypothesis to domain functions cover competitor analysis, demographical, behavioral and seasonality trends. Comprehensive hypothesis testing requires reaching out to multiple people and departments. It is essential to involve all the relevant people/team in the brainstorming process. In this case, physicians working on this project, the data-engineer, the CEO and medical director should at least be part of the discussion. A sample list of hypothesis questions is provided.

Demographics
Is sequential modeling approach wherein an initial regression model assigns risk and all patients deemed high risk then have their risk quantified by a second, high risk-specific, regression model would result in a model with superior calibration across the risk spectrum?
How can one be assured that patients at risk are globally similar or is there a race difference?
Is food an influencer as observed in studies covering cardiovascular risk stratification?

Behaviour
What is the inter-, and intra-observer variability for scoring APACHE IV?

Seasonality trends
Is seasonality impacting the dataset, in other words, are seasons responsible for different patient outcomes in terms of mortality (30, 90 days; ICU mortality; hospital mortality)?
Are weekdays and admission times influencing the APACHE IV score for the initial 24h after ICU admission?

Conclusion

Machine learning models can be deconvoluted to generate novel understandings of how ICU patient features from long-term and short-term events interact with each other. Explainable machine learning models are key in clinical settings, and the results emphasize how to progress towards the transformation of advanced models into actionable, transparent, and trustworthy clinical tools. Essential however is to start your data science lifecycle by transforming the problem as presented by the domain expert into a data problem. This is followed by hypothesis building which was demonstrated in this text.

Sven Van Poucke, MD, PhD
Data Scientist, xAI Invoker, ML Geek, Anesthesiologist, Emergency Physician
Just-Aspi.com
ZOL, Genk, Belgium

Data scientists are there to solve problems bridging the gap with domain experts. Read how they accomplish this.

Problem Statement

Hypothesis Generation

Conclusion

Written by Sven Van Poucke