Process Mining with Python tutorial: A healthcare application — Part 1

Published in

Wonderful World of Data Science

6 min readJul 29, 2020

This article is the first of a tutorial series made up of the following parts:

Part 1 (this article): Introduction to process mining, data preprocessing and initial data exploration.
Part 2: Primer on process discovery using the PM4Py (Python) library to apply the Alpha Miner algorithm.
Part 3: Other process discovery algorithms and model representations.
Part 4: More holistic models which integrate control flow, time (e.g. bottlenecks, wait times), resources (e.g. personnel capacity and performance, inter-personnel relationships, department/ward capacity and performance), case attributes (e.g. patient demographics, clinical condition).

We will be working through a series of exercises (originally intended for process mining software) which come from this Process Mining in Healthcare course. You can find the complete source code and data for this tutorial series here.

For a more general introduction to process mining in Python, you might want to check out this easy-to-follow article.

Process mining in healthcare

In recent years, digitisation of healthcare records and systemetisation of healthcare processes has resulted in the generation of more and more data from complex healthcare processes. There has also been growing interest in using process mining techniques to optimise and debug processes to improve the quality and efficiency of care. Such techniques have been used to:

Discover processes and characterise them in process models. Different graphical languages such as petri nets, directly-follows graphs and business process models might be used to represent such models.
Discover bottlenecks and identifying opportunities for improving efficiency by analysing throughput and the time spent on each event.
Determine to what extent real processes adhere to those in good practice guidelines and treatment pathways.

What is a process?

In process mining, a process is sequence of events (discrete actions) that are executed to reach a particular goal or outcome. For example, we can characterise each patient’s hospital journey as a process, starting from when they are admitted to when they are discharged. Everything that happens to them in the hospital is an event that makes up this process.

Establishing cases, events and resources

For the purposes of process mining, an event log must contain:

case identifiers, which track entities through the events of a process, e.g. patient id, doctor id;
event names/descriptions/identifiers, which give us a handle on the event types that feature in the processes;
time stamps (associated with each event), which allow us to determine the order and duration of events.

Optionally, you will also have resources associated with each event, e.g. the equipment or medical professional involved. Event attributes other than timestamps and resources are also possible, e.g. transactional information, costs. Attributes are also possible for cases, e.g. patient age, diagnosis, outcome, flow time, contextual attributes.

The first thing to establish is what we are trying to find out and which columns we are treating as the cases, events and (optionally) resources. From this you’ll then be able to decide which column you are treating as the case, which you are treating as the event, and which you are treating as the resource (optional).

Here we give a worked through example, which you can follow through yourself by downloading the event log data here.

Here are the first few items:

Extract of the event log of hospital cases

Let’s say our objective is to better undertand the different journeys that patients go through, cases would be associated with the patient column, events with the action column, and resources with the resource column (but if we had a different objective, we might need to choose a different mapping, e.g. we might treat the resource column as the case if we were studying the processes from the perspective of medical practitioners rather than patients).

Preparing and exploring the event log

A first step in preparing the event log is to calculate the relative time of each event, which is time the event occurs with respect to the beginning of the process.

# Create a pivot table of the start (minimum) and end (maximum) timestamps associated with each case:
case_starts_ends = events.pivot_table(index='patient', aggfunc={'datetime': ['min', 'max']}) 
case_starts_ends = case_starts_ends.reset_index() case_starts_ends.columns = ['patient', 'caseend', 'casestart'] # Merge with the main event log data so that for each row we have the start and end times.
events = events.merge(case_starts_ends, on='patient') # Calculate the relative time by subtracting the process start time from the event timestamp
events['relativetime'] = events['datetime'] - events['casestart']# Convert relative times to more friendly measures
## seconds
events['relativetime_s'] = events['relativetime'].dt.seconds + 86400*events['relativetime'].dt.days 
## days
events['relativedays'] = events['relativetime'].dt.days

The dotted chart

To get an initial feel for what processes ‘look like’, it is useful to visualise the events associated with each case over (relative) time. This can be done with what is known in the process mining community as a ‘dotted chart’. You can implement this using a scatter plot or strip plot in Python, which plots the event sequences of each case against time.

## Get an array of patient labels for the y axis - for graph labelling purposes
patientnums = [int(e) for e in events['patient'].apply(lambda x: x.strip('patient'))]## Plot a scatter plot of patient events over relative time
ax = sns.scatterplot(x=events['relativetime_s'],
y=events['patient'], hue=events['action'])## Set y axis ticks so that you only show every 5th patient - for readability
plt.yticks(np.arange(min(patientnums), max(patientnums)+1, 5))

Dotted chart of event occurrences within each case (patient)

This is quite difficult to read. To make the plot easier to read, we should order the cases by overall process lengths. This also gives you a better feel for the distribution of process durations.

## Order by the case length
ordered = events.sort_values(by=['caselength', 'patient', 'relativetime_s'])ax = sns.scatterplot(x=ordered['relativetime_s'], y=ordered['patient'], hue=ordered['action']) plt.yticks(np.arange(min(patientnums), max(patientnums)+1, 5)); plt.show()

Ordered dotted chart of event occurrences within each case (patient)

The dotted chart can also be used to get an idea of the event flow over absolute time, e.g. if cases are coming in regularly, if there are weekly or daily trends.

ax = sns.scatterplot(x=events['datetime'], y=events['patient'], hue=events['action']) plt.yticks(np.arange(min(patientnums), max(patientnums)+1, 5));

Dotted chart of cases over (absolute) time

This shows a steady flow of cases over time.

You can also study events with respect to the resources by plotting events with respect to them.

ax = sns.scatterplot(x=events['datetime'], y=events['resource'], hue=events['action'])

Dotted chart of events with respect to resources over time.

To investigate weekly trends, you first need to convert the time stamps into days of the week.

## Get day of week 
events['weekday'] = events['datetime'].apply(lambda x: x.weekday())

This time you should use a jitter plot rather than a strip plot because the x axis (day of week) is categorical.

## Strip plot
ax = sns.stripplot(x=events['weekday'], y=patientnums, hue=events['action'], jitter=0.2)

Dotted chart of cases. — Dotted chart of cases over days of the week

Filtering events

One final thing you might want to look at is which events are shared by all processes and which are not, since in process mining it is the non-shared differentiating events that we are interested in

## Create a table giving the number of cases in which each event is present.
patient_events = pd.crosstab(events['patient'], events['action']) ## Visualise in a heatmap
sns.heatmap(patient_events, cmap="YlGnBu")## Calculate the number of unique event counts 
## This should be 1 for events which are shared by all patients.
nunique = patient_events.apply(pd.Series.nunique) ## Identify the events which are shared by all 
shared_actions = nunique[nunique==1].index 
actions_to_keep = nunique[nunique>1].index print('The following actions are common to all cases: {}'.format(', '.join(shared_actions))) 
print('The following actions are the ones that we wish to keep (not common to all cases): {}'.format(', '.join(actions_to_keep)))

Heatmap showing which events were shared (and not shared) between cases (patients)

In this example, First consult, Blood test, Second consult, Physical test and Final consult are the events shared by all patients, while Medicine, Surgery and X-ray scan are the differentiating events.

Coming up next

In the next article we will start using the pm4py library to discover process models.