Uncovering the Truth with Plotly
Data Visualization in Python
Plotly is a data visualization company founded in 2012 by Alex Johnson and Chris Parmer. The vision was to create a platform for creating interactive, web-based visualizations that could be easily shared and embedded in web applications. Later, Plotly continued to expand its offerings, adding support for Python, R, and MATLAB through its APIs, providing businesses, organizations, and individuals with various visualization tools and services. Today, Plotly is one of the most popular visualization tools available. Python’s Plotly library, sometimes called “plotly.py” is an open-source library for creating interactive web-based visualizations displayed in Jupyter notebooks or can be exported as standalone HTML files.
In this article, I will be using plotly.graph_objects module, usually imported as ‘go.’ Plotly graph objects refer to instances of Python classes that represent visualizations, such as scatter plots, bar charts, and histograms. Graph objects provide a high degree of control over the visualization and are ideal for creating complex, custom visualizations. We will use Plotly to study a compelling case from the ‘Art of Statistics’ by David Spiegelhalter.
Harold Shipman, a family doctor based in Manchester, became Britain’s most prolific convicted murderer by administering a massive opiate overdose to at least 215 elderly patients between 1975 and 1998. Shipman was caught after he forged the will of one of his victims, and suspicions were raised by her daughter. Further investigation revealed that he had been retroactively changing patient records to make his victims appear sicker than they really were. Shipman was an early adopter of technology but was not tech-savvy enough to realize that his changes were time-stamped, eventually leading to his downfall. Fifteen of his patients who had not been cremated were exhumed, and lethal levels of diamorphine were found in their bodies. In 1999, Shipman was tried for fifteen murders but chose not to offer any defense, and he was subsequently found guilty and jailed for life. A public inquiry was set up to determine whether Shipman had committed any other crimes and whether he could have been caught earlier. Several statisticians, including the author, gave evidence at the inquiry, concluding that Shipman had murdered at least 215 of his patients.
We are going to use two datasets available in the below GitHub links:
import pandas as pd
import numpy as np
import plotly.graph_objs as go # graph objects
import plotly.offline as pyo # to save the visualization in HTML format
url = "https://raw.githubusercontent.com/dspiegel29/ArtofStatistics/master/00-1-age-and-year-of-deathofharold-shipmans-victims/00-1-shipman-confirmed-victims-x.csv"
# Records of all victim details
shipman_victims = pd.read_csv(url)
# Recorded time of the patients treated by Harold as compared to local family doctors
shipman_times = pd.read_csv('https://raw.githubusercontent.com/dspiegel29/ArtofStatistics/master/00-2-shipman-times/00-2-shipman-times-x.csv')
After loading both datasets, ‘shipman_victims’ dataset has 215 records, and ‘shipman_times’ has 24 records.
Scatter Plot to visualize ‘Date of Death’ and ‘Age’:
# 'DateOfDeath' is object datatype but we will change it to datetime
shipman_victims['DateofDeath'] = pd.to_datetime(shipman_victims['DateofDeath'])
# 'data' is always passed as a list in graph objects:
# go.Scatter- Scatter Plots:
data= [go.Scatter(x= shipman_victims['DateofDeath'],
y= shipman_victims['Age'],
mode='markers',
marker=dict(size=12,
color='rgb(51,51,153)',
line={'width':.2}
)
)
]
# go.Layout is a class that represents the layout of a plot,
# you can customize the visual appearance of a plot using this:
layout= go.Layout(title = 'Age vs. Date of Death', # Graph title
xaxis = dict(title = 'Date Of Death'), # x-axis label
yaxis = dict(title = 'Age'), # y-axis label
hovermode ='closest' # handles multiple points landing on the same vertical
)
## go.Figure creates a new Figure object for creating visualizations:
fig= go.Figure(data=data, layout=layout)
fig.show()# to visualize the plot in jupyter notebook
# or you can choose to save the plot locally:
pyo.plot(fig, filename='ScatterPlot.html') # visualize as a stand-alone HTML page
Once you run the above code, you will observe that the plot shows the date of death and age if you hover over the data points. Also, you will have the option to zoom in on a particular area of the plot. One thing to note is that ‘plotly.offline’ allows you to save the plot in your system in HTML format. So from this plot, it can be noted that there is a gap of three years from his first reported victim to the later one, and not all of his victims were elderly; few cases recorded have the victim’s age ranging from 40–50 years. Also, the data points are more clustered from Dec 1993 to Jun 1998, showing the increased number of victims within this timeframe. So if we zoom in on the plot, we will get something similar to the below visualization:
We can see that the age of the victims recorded ranges from 43 years to 90 years old.
Nested Bar Graph to visualize ‘Place of Death’ and ‘Age’:
# Creating traces for each bar based on the age bracket:
trace1= go.Bar(x= shipman_victims['PlaceofDeath'],
y= shipman_victims[shipman_victims['Age'] <= 49]['Age'],
name='Age less than or equal to 49 years'
)
trace2= go.Bar(x= shipman_victims['PlaceofDeath'],
y= shipman_victims[(shipman_victims['Age']> 49) & (shipman_victims['Age'] <=59)]['Age'],
name='Age in 50-59 bracket.'
)
trace3= go.Bar(x= shipman_victims['PlaceofDeath'],
y= shipman_victims[(shipman_victims['Age']> 59) & (shipman_victims['Age'] <=69)]['Age'],
name='Age in 60-69 bracket'
)
trace4= go.Bar(x= shipman_victims['PlaceofDeath'],
y= shipman_victims[(shipman_victims['Age']> 69) & (shipman_victims['Age'] <=79)]['Age'],
name='Age in 70-79 bracket'
)
trace5= go.Bar(x= shipman_victims['PlaceofDeath'],
y= shipman_victims[(shipman_victims['Age']> 79) & (shipman_victims['Age'] <=89)]['Age'],
name='Age in 80-89 bracket'
)
trace6= go.Bar(x= shipman_victims['PlaceofDeath'],
y= shipman_victims[(shipman_victims['Age']> 89) & (shipman_victims['Age'] <= 99)]['Age'],
name='Age in 90-99 bracket'
)
# Passing the data as a list with all the traces:
data= [trace1, trace2, trace3, trace4, trace5, trace6]
#go.Layout to customize the visual appearance:
layout= go.Layout(title='Place of Death and Age',
xaxis= dict(title='Place of Death'),
yaxis= dict(title= 'Age')
)
#go.Figure for visualization:
fig= go.Figure(data=data, layout=layout)
#pyo.plot to save the html file locally:
pyo.plot(fig, filename='NestedBarGraph.html')
From the above graph, we observe that victims mostly died at their own homes.
Bubble Chart to visualize the victims for which Harold was ‘Convicted’:
#data for bubble chart
data= [go.Scatter(x= shipman_victims['DateofDeath'], #xaxis
y= shipman_victims['Age'], #yaxis
text=shipman_victims['gender2'], #Datapoints representing 'Male' or 'Female'
mode='markers',
marker=dict(
size= shipman_victims['Age']/5, # set the size
color=shipman_victims['Decision'], # color of markers based on decision column
colorscale='Viridis', # choose a colorscale
colorbar=dict(title='Decision') # add a colorbar with title
)
)
]
#layout
layout = go.Layout(
title='Bubble Chart for Age vs. YearOfDeath',
xaxis = dict(title = 'Year of Death'), # x-axis label
yaxis = dict(title = 'Age'), # y-axis label
hovermode='closest'
)
#figure
fig = go.Figure(data=data, layout=layout)
#pyo.plot to save the html file locally:
pyo.plot(fig, filename='Bubblechart.html')
From this bubble chart, we can say that there were 15 women as his victims for which he was convicted. The youngest among these victims was a woman, 49 years old. The decision for the remaining victims was ‘Unlawful Killing.’
Histogram to compare ‘Age’ with ‘Gender’:
# Adding two Histograms in 'data' list for 'Female' and 'Male':
data = [go.Histogram(
x=shipman_victims[shipman_victims['gender']==0]['Age'],
opacity=0.75,
name='Female'
),
go.Histogram(
x=shipman_victims[shipman_victims['gender']==1]['Age'],
opacity=0.75,
name='Male'
)]
layout = go.Layout(
barmode='overlay',
title="Victim's Age Comparison by Gender"
)
fig = go.Figure(data=data, layout=layout)
#pyo.plot to save the html file locally:
pyo.plot(fig, filename='Histogram.html')
From the above plot, we can conclude that females in the 80–84 years range were targeted the most. Overall, we can see that there were more female victims than males.
Let’s visualize data in our second dataset, ‘shipman_times.’ So this dataset has the hour of death for Harold Shipman and other local family doctors. To compare both, we can plot Line Charts.
# create the traces
shipman_trace = go.Scatter(x=shipman_times["Hour"], y=shipman_times["Shipman"], mode="lines+markers", name="Shipman")
comp_trace = go.Scatter(x=shipman_times["Hour"], y=shipman_times["Comparison"], mode="lines", name="Comparison GPs", line=dict(dash="dash"))
layout = go.Layout(title="Deaths by Hour of Day",
xaxis=dict(title="Hour of Day"),
yaxis=dict(title="% of Deaths", range=[0, 16]))
fig = go.Figure(data=[shipman_trace, comp_trace], layout=layout)
# update the legend font size and color
fig.update_layout(legend=dict(font=dict(size=16), orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1))
# add annotations
fig.add_annotation(x=12, y=14, text="Shipman", font=dict(size=14, color="blue"), showarrow=False)
fig.add_annotation(x=4, y=7, text="Comparison GPs", font=dict(size=14, color="red"), showarrow=False)
fig.show()
pyo.plot(fig, filename='Linechart.html')
From the above plot, we can see the striking difference in the hour of death of the patients of any local family doctor compared to Shipman’s. Most of his victim’s hour of death was recorded in the afternoon.
In conclusion, we worked with interesting datasets with Plotly. It is a powerful tool for creating interactive and visually appealing data visualizations. Users can gain deeper insights into their data and communicate them more effectively using its features. Whether you’re a data scientist, journalist, or business analyst, Plotly can help you tell compelling data stories that engage and inform your audience.
References:
‘Art Of Statistics’ by David Spiegelhalter
https://plotly.com/python/graph-objects/