Python 101: Back to Basics — Part 3

Creating visualisations using Plotly

Wayne Berry

Published in

Tech News & Articles

8 min readJul 25, 2023

Preface

Following on from part 2 of my back to basics series, we’re finally going to create some visualisations. For those like me, that suffer from data viz OCD and don’t mind a cool looking graph or 2. Your time has come!

Preparation

As with part 1 & 2, i’ll be using Jupyter Notebook. If you are just joining this series please start on part 1 and work your way through or at least grab the full code from my github repository at the end of this tutorial as you’ll need to run it all to progress to part 2 & 3.

Open up your previous notebook with part 1 & 2 code in it and let’s get started.

Step 1 — Histogram

Our first port of call and something I like to do whenever i’m doing analysis on data is understanding the distribution of the data. For this example we’re going to have a look at the offences per 100k capita across the 3 years of data we hold.

# Histogram
import plotly.express as px
fig = px.histogram(agg_yr, x="offence per 100k capita/mth",
color = "year", # color of histogram bars
barmode = 'overlay')

fig.show()

For this graph we’ve overlayed each year on top of the other and set the color to be determined by the year field.

As expected from this type of data, we’re seeing a positively skewed (right) distribution. However we’re also seeing 2 peaks, at the 500 population and again at the 800–900 range. This interesting and will no doubt tell a story around the nature of the populations the more we dig into this data.

This tutorial isn’t about analysing the data so I will minimise narrative around that aspect from here on, only pointing out some key interests.

Step 2 — Scatter plot

Creating a scatter plot should show us a good representation of our offences vs population. Let’s have a look.

# Scatterplot
import plotly.express as px
fig = px.scatter(agg_yr, x="offence per 100k capita/mth", y="population", 
                 color="year",
                 size="population",
                 #trendline="ols",
                 hover_data=['lga'])
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig.update_xaxes(gridcolor = 'aquamarine')
fig.update_yaxes(gridcolor = 'aquamarine')
fig.show()

For this graph i’ve used offences per 100k capita as my X axis and population as my Y axis. In addition i’ve set the colour by the year field and the size of each bubble is determined by the offences per 100k capita. Feel free to play around with these settings to see how it changes the representation of the data.

There’s some interesting observations, the main one being, it appears the lower the population, the higher the offence rates.

Step 3 — Sunburst Plot

Let’s see if we can create a pie graph to show the level of crime in each LGA. We’ll use Plotly’s sunburst feature for this

#Sunburst
import plotly.express as px
fig = px.sunburst(agg, path=['lga'], values='offence per 100k capita/mth', 
                  color='offence per 100k capita/mth', hover_data=['population'],
                  color_continuous_scale='RdBu_r')
fig.show()

It’s a fairly messy graph & given the number of LGA’s we’d expect it to be. However it gives some insights. You can hover over each LGA to highlight it’s stats.

Let’s add year into the mix and see what comes out.

# Sunburst by year
import plotly.express as px
fig = px.sunburst(agg_yr, path=['year', 'lga'], values='offence per 100k capita/mth', 
                  color='offence per 100k capita/mth', hover_data=['offence per 100k capita/mth'],
                  color_continuous_scale='RdBu_r')
fig.show()

Very much the same story for each year. There’s not a lot of insights here.

Step 4 — Bar & Line Graph

We got a few insights from the pie graphs but they weren’t great. Let’s try a bar graph on the same cut of data.

# Bar graphs

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Scatter(x=agg['lga'], y=agg['population'], name="Population", mode="lines"),
    secondary_y=True
)

fig.add_trace(
    go.Bar(x=agg['lga'], y=agg['offence per 100k capita/mth'], name="100k capita"),
    secondary_y=False
)

# Set y-axes titles
fig.update_xaxes(title_text="NSW LGA")
fig.update_yaxes(title_text="100k Capita", secondary_y=False)
fig.update_yaxes(title_text="Population", secondary_y=True)
fig.update_layout(
    template='simple_white',
    paper_bgcolor='#F9F9FA',
    plot_bgcolor='#F9F9FA',
    height=550,
    margin=dict(
        t=20, b=20, l=60, r=40
    ),
)

fig.show()

A couple of things to note on this graph. I set the population as a line graph on a secondary Y axis. 100k capita is on the main Y axis. LGA is X axis.

In my opinion this gives us a better view than the pie graphs. We could break this up by each year if we wanted to.

Step 5 — Dig Deeper

Our graphs above were a good start but they don’t really tell us a whole bunch other than high level insights around crime ratios to population. We need to dig down into the crimes themselves and start looking at how LGA’s compare against each other.

If we dig deeper we naturally increase our data points and our graphs become unreadable quickly. Our first step will be to reduce the number of data points. I’ll achieve this by creating a new field labeled “Normalised category” We’ll map offence categories into a handful of common categories.

In my Guthub link below you’ll find an excel spreadsheet with the mappings. Save it somewhere accessible and read it in

cat = pd.read_excel('/Users/wayneberry/Documents/Crime categories.xlsx', sheet_name='NSW')
cat.head(5)

Success? Let’s clean it up, there’s a few columns we don’t need.

cat = cat.drop(columns=['Offence Group', 'Subcategory', 'Weighting'], errors='ignore')
cat.head(5)

Now that we’ve got our new category column, we need tomerge it into our existing data. First step will be to create a new aggregated dataframe, We’ll aggregate by LGA and Offence Category and calculate our per capita fields.

#Create Aggregated table by LGA & Offence Category

agg_off = df.groupby(
   ['lga', 'offence_category', 'year']
).agg(
    {
         'count':sum,    # Sum
         'population': "max",  # get the count of networks
        'date': pd.Series.nunique, 
    }
).reset_index() #This resets the index to return a dataframe

#Calculate per capita

agg_off['offence per 100k capita'] = (agg_off['count']/agg_off['population'])*100000
agg_off['offence per 100k capita/mth'] = (agg_off['count']/agg_off['population']*100000)/agg_off['date']



agg_off.head()

Let’s now start our merge. First step is to check that we have no missing mappings between the main data and our new normalised category dataframe.

catcomp = cat.merge(agg_off, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

catcomp = catcomp['lga'].drop_duplicates()
catcomp

catcomp = cat.merge(agg_off, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']

catcomp = catcomp['lga'].drop_duplicates()
catcomp

Hopefully you get the same results as on your screen. If you see offence categories in the right output (2nd query run), you will need to update the mapping spreadheet to map the new offence to a normalised category. Alternatively if you see an output from the 1st query, it means that offence is in the mapping spreadsheet but not in the main data. Simply remove it from the mapping spread sheet. Re-run the above cells to read in the updated mapping spreadsheet onwards.

Once your outputs return empty, let’s try the merge.

cat_agg = cat.merge(agg_off, on='offence_category', how='left')
cat_agg

Success. We have the Normalised category field in our new aggregated dataframe.

Let’s now create a scatter plot with the new data

# Scatter plot
import plotly.express as px
fig = px.scatter(cat_agg, x="offence per 100k capita/mth", y="population", 
                 color="Normalised category",
                 size="offence per 100k capita/mth",
                 #trendline="ols",
                 hover_data=['lga'],
                width=1400, height=600)
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig.update_xaxes(gridcolor = 'aquamarine')
fig.update_yaxes(gridcolor = 'aquamarine')
fig.show()

This gives us a good view of the new category against population and per capita. Let’s now do something a little unusual and swap our X axis to be the Normalised category field and Y axis to LGA. We’ll keep the bubble size to be set by the per capita field.

import plotly.express as px
fig = px.scatter(cat_agg, x="Normalised category", y="lga", 
                 color="Normalised category",
                 size="offence per 100k capita/mth",
                 #trendline="ols",
                 hover_data=['lga'],
                width=1400, height=800)
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig.update_xaxes(gridcolor = 'aquamarine')
fig.update_yaxes(gridcolor = 'aquamarine')
fig.show()

This gives us an interesting view. It’s not something I would normally do with a scatter plot but it gives us a decent view of our 3 dimensions. Offence type, LGA and per capita. Keep in mind the axis are truncated to fit the physical size of the graph so not all LGA’s are visible.

Of course it would be a lot nicer (and cooler) to plot this on a nice chloropleth or other geospatial map but that’s outside the scope of this tutorial.

Conclusion

We’ve concluded the 3 part back to basics series.

In Part 1 we covered data wrangling, cleansing and storage in a Mysql database.

In Part 2 we undertook deeper cleansing and wrangling, merging datasets and performing per capita caculations to normalise the data and prepare for visualisations.

Part 3 we have created a handful of graphs to visualise our data using different graphs and changing axis fields to demonstrate how the visualisation changes.

This is a starting point to get you familiar with these various aspects. Please play around with the graphs and try other graphs available in Plotly. If you’re feeling more adventurous why not try some of the other graphing packages such as Seaborn.

The complete notebook and Excel mapping file is available for download at my Github repository.

If this was of interest to you, please follow me for more articles and tutorials.