How to Put your Data to Work in Healthcare

Published in

It’s a data world

4 min readJan 12, 2016

Hi there!

Quantified Self, Population Health, Patient Engagement, Telehealth, Interoperability… The Healthcare IT industry is buzzing with plenty of opportunities but is missing a few basic standards to implement them.
That’s why I wanted to share with you a complete methodology to develop data projects in the healthcare industry.

Step One: Order Out of Chaos, Collecting & Making Sense of Data

1) Define your Goal(s)

In order to keep costs within budget and to realize feasible results, it is necessary to specifically define the project goal. For this example, our goal is to score the likelihood of patient no-shows in real-time. The scoring would be used to identify high-risk patients and schedule the best time slots for them in order to decrease the likelihood of subsequent no-shows.

2) Collect Historical Data (Appointment Dataset)

In order to create an algorithm, the predictive analytics solution needs to work with data. If possible, provide 3 months’ worth of historical show/no-show data; if not possible, you may need to collect this data for 3 months before beginning the predictive modeling process.

3) Gather Workable and Clean Datasets

Next, we need to determine the datasets that will be used to establish patient scoring. In other words, the factors that will determine whether or not a patient is likely to appear for a given time slot.

Some possibilities include:
•Appointment Dataset: historical data of shows and no-shows;
•Patient Datasets: age, location, health problems, diseases, children, status…
•External Sources: social mapping of geographic area, transportation data, disease classification (i.e., effect of disease on the patient’s lifestyle — for example, wheelchair-bound? mobility? capabilities? limitations?), bank holiday calendars, weather, and so on.

Just like Sherlock, ask yourself the right questions

Some key questions to answer: how frequently are these datasets updated? Are they automated? Is accurate and up-to-date data available?

4) Combine and Clean your Sources

Combine all data sources, clean the data, delete empty/incorrect fields, and en- sure that the same level of detail — in terms of granularity — is applied across all data points (e.g., weather data may be available daily while appointment sheets are created on a weekly basis).

Datasets cleaning, Shutters polishing: same fight

It is common for datasets to be available in different formats (xls, calendar files…), so one of the challenges of data collection will be shaping them all in a common processing-friendly format.

Step Two: A Predictive Model to Test your Hypothesis

1) Highlight and Pinpoint Distinct Features

The process of building a predictive model involves a series of normalization and optimization steps designed to determine model accuracy. Some key steps in this process include feature normalization, testing & optimization of models, determination of model accuracy, and the specification of a user strategy. After the model is defined, the data scientist needs to overfit the model, evaluate, and ultimately validate it in order to isolate features.

The determination of accuracy is done by testing the underlying strategy in practice; for example, given patients who are likely to appear for a given time-slot, do they actually show up as expected? How accurate is the time- slot scoring for patients who do appear? If overbooking is implemented, is it being applied correctly? These questions all need to be addressed in order to determine the accuracy of the underlying analytical model — this involves comparing real-world results with the relevant predictions. This level of additional analysis will enable a data analytics solution to further refine the model’s accuracy, if needed.

Of course if you are using an advanced software analytics solution, then many of the above steps would be automated. It would be able to clean datasets, isolate specific features, and automatically score the likelihood of patient no-shows.

2) Train Machine Learning Models on Test Datasets

If new features are added, then the models need to be re-trained. Additionally, data visualization needs to be done in order to determine if the features are relevant.

Like what you read? Keep going! The rest of it is available on Dataiku’s whitepaper “Advanced Analytics for Efficient Healthcare. Data Driven Scheduling to Reduce No-Shows”.

In this ebook we highlight a specific issue — no-show appointments — and show how healthcare institutions can leverage predictive analytics to discover real-world solutions to a multi-billion dollar problem. Go and Enjoy your Read!

Take Care.