High Volume Prediction at Hospital Clinics

Avi Paz
Analytics Vidhya
8 min readMay 20, 2019

--

Recently one of my customers came up with an idea. He wanted to build an advanced Analytics and AI solution that measures and predicts high load at hospitals clinics. The motivation was to decrease clinic queues length and total patient waiting time. The clinic operation-manager will be able to respond early enough to prevent or decrease high load situations by taking data driven decisions for better human resource allocation & utilization in real time.

[For example: the system can recommend in advance to reinforce sub-department#2 by switching two nurses, one doctor and one secretary from sub-department#3 two hours before high load is occurring and by so to actually prevent very long waiting time for sub-department#2's patients.]

PAIN/PROBLEM DEFINITION + POTENTIAL FEATURES:

Many hospital clinics are experiencing unexpected heavy patients loads. There are often situations where patients are experiencing long waiting time, while some of the personnel are idle.
In many cases reallocation is taking place too late. After several days of investigation by staying inside the clinics we considered and pointed out the following reasons:

  1. Missing Doctors — In many cases, unexpectedly, doctors are called urgently to emergency rooms which obviously disrupts the pre-planned allocation and introduces high volume bottlenecks.
  2. Patients’ Profile — Age, previous diagnostics, admission history, native language: it seems like these parameters strongly affect the total duration of the stay in clinics.
  3. “Uninvited Patients“ — These consist more than 10% of total visits per day. Which obviously prolong “starvation” time and affect the general clinic high load and queue delay.
  4. Complexed Flow — Many clinics include nontrivial station (e.g registration => prep =>reference check1 => doctor check => reference check2 =>back to doctor => next visit registration).
  5. Weather — Can lead to patients’ and service providers’ delay of arrival.
  6. Public Transportations — Can lead to patients’ and service providers’ delay of arrival.
  7. Various service providers and expensive equipment. (Doctors, Nurses, Technicians, Secretaries).

To sum it up , we concluded these are the major unexpected and indeterministic factors that are making the pre-planning process and ad-hoc realtime response so challenging for the clinic operation manager and we thought we can start and validate these features as candidates for our machine learning model.

PROPOSED SOLUTION:
Realtime visualization system with high volume prediction (60–120 minutes before occurrence) with advanced analytics tracking of highly significant measurements.

HIGH VOLUME DEFINITION SUGGESTION:
We can-not escape from it. We must clarify what high load means, so let’s:

  • Decide our prediction times in minutes and mark it as pt = [60, 90, 120]
  • Define waiting stations and mark it as st = [reg, ref1..3, doctor, total]
  • Minimum duration of stay in waiting lists as wt = [0…720]

Now we can mark our prediction results as p(n, pt, st, wt)

For example: p(n, 120, ‘ref2’, 0) is the predicted number of waiting patients for ‘reference-2’ waiting list, 120 minutes from now, that are expected to wait at least 0 minutes. (which means all waiting patients at that point of time)

Please note: we didn’t define ’n’ yet as this parameter will be a composition of our realtime features vectors list and this will be discovered only after data exploration and data preparation phases.

Now, if at least one of the followings is true we will mark this prediction as high load:

  • p(n, x1, x2, 0]) prediction result is higher than the actual average plus three times standard deviation for that hour (now + x1), and function(x2).
  • p(n, x1, x2, 60]) result is more than 25% from ‘x1’ waiting list reached at least 60 minutes latency.

PROOF OF CONCEPT:
We selected one of the clinics we inspected, then took the following steps:

  1. High Level Architecture & Data Ingestion.

2. Data Exploration.

3. Data Prep and Feature Engineering.

4. AutoML — Regression Model Selection & Training.

5. DeployML — Regression Model Deployment.

6. Visualization — Streaming datasets visualization in realtime including ML results.

High level architecture and Data Ingestion:

Our customer have applications hosted on on-premise environments. For the first step we wanted to replicate the data to their azure subscription, continuously.

we suggested to unite all the relevant datasources into a dedicated sql server in a DMZ environment and then replicate the data to azure sql.

Data factory and SQL Data sync might be better selections for this synchronization job as they offer managed services.

The second step was to stream the data into a visualize dataset in (near) realtime. We provisioned Recurrence Logic App that query the data and stream it into EventHub.

At this point we created a new Azure Stream Analytics service to connect the dots and to provide the visualization platform with an access to the streaming data-set from the event hub. You can find more details about it here.

Data Exploration:

We uploaded historic queueDB and DemographicsDB data to azure blobs for data exploration. I chose Azure Databricks which in my opinion, the default managed platform to go with for big data exploration and preparation.

At this point I considered using Time Series services such as Kusto but I choose Databricks for simplicity purpose.

Another major advantage of this platform is the polyglot programming support. As can be seen in the attached notebook snippets, I wrote several parts in sql, some other parts in python and of course in my preferred programming language for spark — Scala.
Here is Matei Zaharia (creator of Spark and CTO at Databricks) response about why he chose Scala over other programming languages for Spark creation.

planned visits by day of week
actual visits by day of week (aggregated hourly)
actual visits by hours with day of week comparison
clinic loads heat-map
anomalies in total duration of stay by dates
high duration of stay by clinics working hours

Data Preparation and Features Engineering:

Now we have ingested and explored the data. The next thing we can do is Feature Engineering.

Feature Engineering is basically a technique for finding Feature or Data from the currently available data. There are several ways to do this technique. More often, it is simply a matter of common sense.

At this stage we thought to divide the working hours in each day (Sunday — Friday) to 15 minutes buckets, where we mark in each bucket 0 or 1 to indicate that the patient is waiting.

actual checkin-checkout joined by patients metadata

Next thing was to group by date column and aggregate by summing up each bucket.

data divided to 15 minutes buckets

Now, we would like to add into each bucket a vector with patient “demographics” data, such as: Sex, Age, Duration of stay, previous diagnostics, previous admission days, etc .

adding the vectors for each feature

Now again group by date and sum up all these vectors.

group by all vectors by date for each bucket (column)

At this point I wanted to FlatMap each vector within a bucket so each column will be divided into 8 columns (one for each cell in the vector). After the separation we need to rearrange the columns to organize the features f1,.., fn with the relevant label.

I divided the data set and rearranged it with Pandas (switching from spark dataframe to pandas dataframe is quite easy as you need only to import pandas and invoke df.toPandas())

toPandas() to convert to from spark dataframe to panda dataframe

Model Training & AutoML:

So after having properly arranged the dataset to features and label we can try execute several classic supervised learning algorithms and make the comparison between the performances of those regression techniques.

For that mission I tried Azure Automated ML UI which is now integrated part of Azure machine learning service workspace. you can read more about AutoML process in general.

Azure Automated ML:

Automated machine learning performance scope
generate regression models list ordered by performance

And I tried Rapid Miner as well -
Rapid Miner:

predicted results vs real results

Azure ML Studio:

In this step I uploaded the prepared dataset to azure ml studio and with not so much effort I created the experiment and ran it.

predictions results versus reality results (label)

ML Deployment:

The deployment part was by deploying the predicted experiment as a web service on an AKS (Azure Kubernetes Service).

After the ML deployment was done, we combined the online data with the predicted data into a single sql view with another LogicApp, then had the view streamed as a streaming data set into the previously mentioned flow.
(LogicApp => EventHub => Stream Analytics => PowerBI)

60 and 120 minutes forward predications results flow

Another thing worse mentioning is that we had built with the customer another view that holds the last eight snapshots of the required parameters. each row for each 15 minutes bucket.

Realtime Dashboard — Power BI:

Realtime dashboard plus predicted results
Realtime dashboard plus predicted results
predictions comparison

Next steps:

Going forward we will integrate APIs from Open Weather and Moovit. This will enrich our data with weather and public transportation information, which may improve the model accuracy.

Open Weather is quite simple and you can get 60 calls per minutes with no charge.

Full Source code:
Databricks notebook source code can be found here.

I would like to thank my colleagues Kfir Gur Ari who assisted with the PowerBI visualization parts and to Catalin Esanu, Guy Bertental, who assisted and advised me regarding to building the high level architecture.

Avi Paz,
Cloud Solution Architect — Data & AI
Avi.Paz@microsoft.com

--

--