Python 101: Back to Basics — Part 2

Population Data and Per Capita Calculations

Wayne Berry
Tech News & Articles
10 min readJun 28, 2023

--

Preface

Following on from part 1 of my back to basics series, we’re working towards delving into the fun stuff! BUT, before we get to the fun visualisations, we need to do a bit more wrangling of the data.

Preparation

As with part 1, i’ll be using Jupyter Notebook. If you are just joining this series please start on part 1 and work your way through or at least grab the full code at the end of the tutorial as you’ll need to run it all to progress to part 2.

Open up your notebook from part 1 and let’s get started.

Step 1 — Load Data From Mysql

For those who chose not to load the part 1 data into Mysql, you’ll need to re-run all of part 1 code to recreate your dataframes and you can ignore this next step.

For those who stored the data from part 1 into Mysql, we’re going to extract that data into a dataframe. Before we do, let’s reload our python packages and then move on to load the data.

import pandas as pd
from datetime import datetime
from numpy import nan
#load data from Mysql

# import the module
from sqlalchemy import create_engine

# create the sql query
myQuery = '''SELECT *
FROM crimes
'''

# create sqlalchemy engine
engine = create_engine("mysql+pymysql://{user}:{pw}@ enter your mysql host/{db}"
.format(user="crimes",
pw="enter your password",
db="crimes"))

df = pd.read_sql_query(myQuery, engine)

df.head()

Hopefully you’ve had success and you have a data frame output like above? If we were in the real world we’d add some error handling in the db connection script in case of failed connections. That’s beyond the scope of this tutorial.

Step 2 — Population Data

To give us a common measure to analyse the crime data against, we’re going to import population data for each LGA and then create a measure to show offences per 100k capita.

#load Excel file into pandas dataframe
pop = pd.read_excel('https://australiademo.opendatasoft.com/api/explore/v2.1/catalog/datasets/nsw-population-projection-2031-lga-level0/exports/xlsx?lang=en&timezone=Australia%2FBrisbane&use_labels=true')

#print out first few rows of data
pop.head()

You’ll see in the source we’ve used for this example, we have 2011 actual population and a projection each forward 5 years. For this example we’ll pick the nearest to our data (2021 projection).

It is important to note if we were doing this analysis in the real world, we would endeavour to find the actual population for each year to match our crimes dataset. Or if unavailable, we’d at the very least extrapolate a projection for each year using the Avg population growth per year.

By using only a single year as we are here, we won’t see a complete representation of longitudinal changes in crimes between years. We also won’t be able to do any analysis around population migration trends between LGA’s and associated impacts to crime (This would be particularly interest during covid years as we know there was significant migration between LGA’s).

pop=pop[['LGA', 'Projection 2021']]
pop['lga'] = pop['LGA']
pop = pop.drop(columns=['LGA'], errors='ignore')

pop.head()

Before we attempt to merge the population data with our crime dataset, i’m going to do some checks on the merge columns to ensure we don’t have any mismatches. We’ll be merging on the LGA column and there’s a good chance both datasets may have differences such as spelling, different naming or perhaps LGA’s have changed since the population dataset was created.

We’ll be doing the Pandas’s equivalent of a SQL left join, so let’s do a test merge and return any instances where we didn’t find a match in the right table (Population).

popcomp = df.merge(pop, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

popcomp = popcomp['lga'].drop_duplicates()
popcomp

We have a number of mismatches. However, before we spend time investigating, i’m going to do the same in reverse and check the right column mismatches.

The reason for this decision is because the left table (Crime) is created from a source that is updated regularly and the population table was created a long time ago. There’s a good chance LGA’s have changed, so let’s fix mismatches in the right table first and that might by default also address some of the mismatches in the left table.

popcomp_right = df.merge(pop, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']

popcomp_right = popcomp_right['lga'].drop_duplicates()
popcomp_right

We have a long list of mismatches returned (many more than displayed above).

Unfortunately the next step is a bit tedious. We’re going to have to check each of these LGA’s to see if the spelling is correct, the name is correct or if the LGA has been decommisioned and merged into another LGA.

For everyone following this, you’re in luck, I’ve done the hard work for you!

The next choice you have to make is whether you hard code changes into your code using the replace option. Or create an external lookup table that you keep maintained separately and merge in as required.

For this excercise i’m going to hard code into the code base.

pop['lga'] == pop['lga'].replace(['Greater Taree',
'Murray',
'Tumbarumba',
'Bankstown',
'Leichhardt',
'Marrickville',
'Armidale Dumaresq',
'Deniliquin',
'Dubbo',
'Kogarah',
'Nambucca',
'Palerang',
'Snowy River',
'Warrumbungle',
'Wellington',
'Great Lakes',
'Urana',
'Wakool',
'Young',
'Canterbury',
'Holroyd',
'Pittwater',
'Rockdale',
'Gosford',
'Wyong',
'Boorowa',
'Ashfield',
'Auburn',
'Botany Bay',
'Warringah',
'Corowa',
'Conargo',
'Glen Innes Severn Shire',
'Gloucester',
'Guyra',
'Harden',
'Tumut',
'Upper Hunter',
'Unincorporated',
'Hurstville',
'Manly',
'Bombala',
'Cooma-Monaro',
'Cootamundra',
'Greater Hume',
'Gundagai',
'Jerilderie',
'Queanbeyan City'
],
['Mid-Coast',
'Murray River',
'Snowy Valleys',
'Canterbury-Bankstown',
'Inner West',
'Inner West',
'Armidale Regional',
'Edward River',
'Dubbo Regional',
'Georges River',
'Nambucca Valley',
'Queanbeyan-Palerang Regional',
'Snowy Monaro Regional',
'Warrumbungle Shire',
'Dubbo Regional',
'Mid-Coast',
'Federation',
'Murray River',
'Hilltops',
'Canterbury-Bankstown',
'Cumberland',
'Northern Beaches',
'Bayside',
'Central Coast',
'Central Coast',
'Hilltops',
'Inner West',
'Cumberland',
'Bayside',
'Northern Beaches',
'Federation',
'Edward River',
'Glen Innes Severn',
'Mid-Coast',
'Armidale Regional',
'Hilltops',
'Snowy Valleys',
'Upper Hunter Shire',
'Unincorporated Far West',
'Georges River',
'Northern Beaches',
'Snowy Monaro Regional',
'Snowy Monaro Regional',
'Cootamundra-Gundagai',
'Greater Hume Shire',
'Cootamundra-Gundagai',
'Murrumbidgee',
'Queanbeyan-Palerang Regional'],
inplace=True)

pop #Display the output

I can see the top entry used to be Greater Taree and is now Mid-Coast so it appears to have worked. Also note we have 153 rows, we’ll need to refer to this row count shortly.

As good practice you should check the output for each change to ensure we didn’t mix up our replace values and end up renaming something to the wrong value. I’m going to skip that step here for expedience.

Left join duplicate rows

With the changes we made in the population dataset, several legacy LGA’s merged into a single new LGA and resulted in duplicate LGA’s in the dataset.

If we try to merge our population and crime dataset, we’ll end up with many duplicate records aka a Cartesian product. This is the same behaviour we would see from a SQL left outer join where duplicate entries are in the right table.

To counter this issue, in the population dataset, we’ll use the groupby and sum function to roll duplicate rows up into a single row.

pop=pop.groupby(['lga'], as_index=False)['Projection 2021'].sum()

We can see we now have 129 rows (down from 153 earlier) so the groupby has worked.

We’ll now re-test our left and right table mismatches to see if we have any residual issues.

# Test the left table

popcomp = df.merge(pop, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

popcomp = popcomp['lga'].drop_duplicates()
popcomp

We can see with the corrections we made to the right table, we also fixed all but 2 mismatches in the left table. These 2 mismatches occurred because the LGA’s aren’t in the population data at all. We’ll have to manually source the populations and add them into the population dataframe.

According to abs.gov.au, the population for Lord Howe Island for 2021 is 445. A search of boscar.nsw.gov.au found NSW prison population in December 2021 was 12,168. Let’s insert these values into our population dataset.

# Insert row to the dataframe using DataFrame.append()
new_row1 = {'lga':'Lord Howe Island', 'Projection 2021':445}
new_row2 = {'lga':'In Custody', 'Projection 2021':12168}

pop = pd.concat([pop, pd.DataFrame([new_row1, new_row2])], ignore_index=True)
print(pop)

Let’s move forward and re-test our right table mismatches.

# Test the right table

popcomp_right = df.merge(pop, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']

popcomp_right = popcomp_right['lga'].drop_duplicates()
popcomp_right

Success! We have no mismatches in our right table.

Before we attempt to merge the 2 datasets we should take note of the number of rows in the crime dataset and also take a sum of the count field to use as a checkpoint, to ensure we don’t lose or gain rows or values.

It is good practice to take regular checkpoints when doing substantial manipulation of your data such as joining tables. It is very easy to inadvertantly gain or lose rows or have fields such as a count field change without noticing.

# Sum of count field
Total = df['count'].sum()

print (Total)
# Count of rows
df.shape[0]

Let’s now try merge the 2 datasets. We’ll be performing the equivelant of a left join on the LGA column, with the crime data (df) as our left table and population (pop) table as our right table.

df = df.merge(pop, on='lga', how='left')
df

Success, we have a merge! Checking the row count, I can see it still matches the checkpoint we took prior. We’ll check the sum of the count column just to be sure nothing changed there.

Total = df['count'].sum()

print (Total)

Good news the count remains unchanged! Let’s do a final check for null values just to be sure we’ve fixed all of the issues.

df[df.isnull().any(axis=1)]

The last step we’ll do is change the Projection 2021 column to population and our merged dataset will be ready to create the per capita calculations in preparation for creating some visualisations.

df['population'] = df['Projection 2021']

# Drop column
df = df.drop(columns=['Projection 2021'], errors='ignore')

df.head()

Per Capita Calculations

In order to have a common denominator between LGA’s to accurately compare the incidence of offences. We’ll use the per capita measure, which is a standard way of measuring population based statistics. We’ll also adjust it to per 100,000 capita to provide a more meaningful measure and to display on our visualisations better.

If we didn’t use the per 100,000 metric and stuck with the straight per capita calculation, we’d likely return very low numbers (in the decimals). This doesn’t present well in a visualisation.

# Per 100,000 capita

df['offence per 100k capita'] = (df['count']/df['population'])*100000
df.head()

We didn’t have any errors returned such as divide by 0 errors but let’s check for any null values as a matter of due diligence.

If this were a production job, i’d apply error handling to all of my code and calculations to handle instances such as divide by 0 errors.

Error handling is outside the scope of this tutorial.

df[df.isna().any(axis=1)]

No null values! We’re now ready to move onto part 3 which is where we delve into the fun stuff of creating visualisations.

Conclusion

We’ve undertaken a lot of data wrangling in part 1 and part 2. I’ve broken it down into bite size components that hopefully give readers new to Python and/or new to data wrangling insight into, not just he coding aspect but also the types of thought processes and checks and balances you need to adopt, to prepare a new dataset for meaningful end uses cases.

As you become more familiar with Python and the processes around wrangling data, you’ll be able to streamline your coding to expedite much of what was done here.

But also keep in mind in the data wrangling and engineering space. Sometimes it's better to keep code simple and break it up into a number of modularised steps, which facilitates quicker and easier debugging. Rather than complex nested actions, that can be difficult to debug where and why it failed.

Please take a second to subscribe and follow my profile to receive notification of the release of part 3.

--

--

Wayne Berry
Tech News & Articles

Experienced digital transformation professional - Passionate about the future of data and technology.