Python 101: Back to Basics Part 1

Data Extraction, Manipulation, Cleansing and Storage

Wayne Berry

Published in

Tech News & Articles

13 min readJun 13, 2023

Photo courtesy of GagoDesign under Shutterstock licensing

Preface

I normally write opinion pieces on the data industry and on AI tech like ChatGPT but i’m going to split my time for a while to write some Python 101 tutorials on the fundamentals of data extraction, analysis and visualisations using Python.

I realise this isn’t ground breaking stuff and it’s not showing some highfalutin function in python that’ll blow your socks off. However, in order to use the fancy features available in Python, I think it’s important not to bypass the the fundamentals and to be able to do basic data engineering to prepare raw data for those fancier activities.

While I might be rehashing some examples in my tutorials that have been written about many times before. I know from my own experience, when looking for a tutorial, I like to find a plethora of examples so I can find one that best fits my specific use case.

In this tutorial i’m going to give a relatively straight forward example of extracting data from a excel, cleansing it to ensure it’s ready for visualisations and uploading it into a Mysql database for permanent retention.

In Part 2, I will move onto creating some visualisations using Plotly and performing some basic statistical analysis of the data.

This example is derived from a real world example of a broader project I own that gathers crime intelligence and displays it in a React frontend.

Preparation

I’ll be using Jupyter notebook and Python 3.11. I have Anaconda installed and use conda and pip commands to install packages and change environments. I will be doing everything through terminal and Jupyter Notebook.

If you aren’t familiar with installing Anaconda and/or using PIP or Conda commands, please google any number of tutorials that will walk you through it.

The python packages used for this example are:

Pandas,
Numpy
Sqlalchemy
Pymysql
Openpyxl

Step 1 — Create a new environment and install packages

Using environments with Python is a simple way to keep a project working by fixing the package versions installed at the time of creating the project. For those familiar with JS frameworks it’s the equivelant of your packages file.

Open up your terminal and let’s get going.

Create your environment

% conda create --name crimes python=3.11

Follow the prompts and press Y when requested. This will create an environment named “crimes” and install Python 3.11 for that environment.

2. Activate the environment

% conda activate crimes

If you get any errors like conda command not recognised you may have to manually set the PATH for conda. Please google to find the solution, there’s plenty of examples out there.

Next we want to install the packages.

3. Install the python packages

% pip install pandas
% pip install numpy
% pip install sqlalchemy
% pip install Jupyter
% pip install pymysql
% pip install openpyxl

If you ever run into any issues installing packages using pip install, try using conda install instead, this will often get around the issues. If you still run into issues, you will need to research it on google and work thorugh a resolution.

It isn’t strictly necessary to install numpy as it installs with pandas but we’ll do it regardless, to ensure its installed correctly.

Step 2 — Create your Jupyter Notebook

Find the information around your Anaconda installation

% conda info

You should see something like below. The envs directories is the file path where your environment has been created.

To list all of your environments

% conda env list

Note the * indicates which environment is currently active.

2. Change directories to your new environment

3. Open Jupyter Notebook

% jupyter notebook

This will open a Jupyter notebook session in your browser. Your screen should look something like below.

On the right hand side find a drop down named “New” and select Python3 (ipykernel). This will open a new tab with an empty notebook.

Save as and name the notebook crimes.

Step 3 — Start coding

Let’s load up our packages in the notebook:

import pandas as pd
from datetime import datetime
from numpy import nan

2. Run the code

Make sure your cursor is in the cell and hit the run command. This should load the packages.

If you get any errors it’s likely 1 of 2 issues:
1. You haven’t installed a package. Go back to terminal and use pip install to install the missing package. Look out for any errors that may’ve stopped the install.
2. Your environment isn’t activated and/or you didn’t open Jupyter from the crimes directory. Close Jupyter from your browser and stop the terminal instance (Ctrl C on a Mac). Re-key the below in terminal:

% conda activate crimes
% cd /filepath of your crimes environment/
% jupyter notebook

Once Jupyter has re-opened, open up your crimes notebook saved previously and try run the code and load the packages.

Success? Let’s keep going!

3. Source Data

For this example, as you may’ve guessed by the name of our notebook and environment, we’re going to be looking at crime data, specifically offence data from NSW, Australia, down at the Lower Government Area (LGA) level. LGA is basically equivalent to a council, county, parish etc.

This data is readily available online in Excel format and can be downloaded from here: Recorded Criminal Incident by month — by LGA

I can already hear the experts out there wondering why we’re downloading, when we could just read the file straight from the web location. I hear you and that’s exactly what we’ll do in a minute.

Once downloaded feel free to open it up in Excel and have a look. It’s a big dataset going back to 1995 and captured monthly. Take some time to get familiar with the data in there.

One thing to take note of is the months are by column. This will create an issue for us further down the track when trying to create visualisations on the data. We will address this in our notebook later.

Moving back to our notebook, let’s read the the data directly from the source using the Pandas read_excel function. Enter the below into a new cell in your notebook and hit run.

df = pd.read_excel('https://www.bocsar.nsw.gov.au/Documents/Datasets/RCI_offencebymonth.xlsm', sheet_name='Data')

Note - you may encounter the warning below. This doesn’t affect our output.

Once the code has run, we’ll take a look at the output to ensure it read ok.

The Pandas function df.head() let’s us check a small sample. You can extend it further and nominate the number of rows you want to sample using df.head(n). Where n is the number of rows to display.

df.head(5) #return 5 rows

If your screen looks like the above. Well done! You’ve successfully loaded the data into a pandas data frame and displayed the top 5 rows.

4. Data Cleansing

The next thing we need to do is prepare the data for use. Let’s have a look at the type of data we hold in each column by using the dtypes property.

df.dtypes

I can see we have some objects and the months are int64. We also have 339 columns and our output has been truncated.

You may recall the issue I raised earlier around having the months in columns. Our data is essentially in pivot table format and spread both horizontally and vertically.

Thinking ahead of how I want the data to be structured for visualisations and loading into a database, I will need to un-pivot the data, transpose the months into a single column called Date, keep the 3 columns LGA, Offence category and Subcategory and add a new Count column.

Sounds tricky! Thankfully Pandas has a function that lets us do it in a single line of code.

Pandas melt function lets us reshape data such as un-pivot it like we need here.

df=df.melt(id_vars=['LGA','Offence category','Subcategory'], var_name='Date', value_name='Count')

Let’s have a look at the output again.

df.head(5) #return 5 rows

Success! We have a clean looking dataset with only 5 columns and the date stacked vertically instead of horizontally!

Let’s have a look at the attributes of the data again.

df.dtypes

This looks a lot cleaner! We’ve got 3 objects, the Date column is in datetime64 format and Count is int64. All as we’d expect for the given data.

Object dtype is Text or mixed numeric and non-numeric values

The next step is to tidy up column names and add a column for the state where the offence data comes from.

df['state']='nsw'
df['offence_category']=df['Offence category']
df['sub_category']=df['Subcategory']
df['count']=df['Count']
df['date']=df['Date']
df['lga']=df['LGA']

We now have the new columns and the old columns, let’s go ahead and drop the old.

df = df.drop(columns=['Offence category', 'Subcategory', 'Date', 'Count', 'LGA'], errors='ignore')

Success, we have a new column called state and columns names are all lowercase and tidy.

You may wonder why we’ve added the state column given it will all be the “nsw” in this data. This is true however, we may want to append another state’s data in the future. Let’s prepare for that now while we’re in the mode of cleaning up our dataset.
It is good practice to think of possible future changes you may need to make to a dataset and apply the change early on, especially if it’s being loaded into a database for storage. You want to avoid having to rollback and reload an entire dataset or apply complex fixes in a database later down the track.
There’s of course is a fine line where you can spent too much time and resourcing trying to crystal ball and cater for every possible future change that might be needed. This where good planning and data modelling practice becomes important.

The next step will be to have a look at our data and see if there’s any dreaded null values in there that may trip us up down the track.

To achieve this we’ll use the Pandas function isna() (you can also use isnull() to achieve the same result).

df[df.isna().any(axis=1)]

We have a lot of null values (572k rows) but thankfully only on a single column. We have a few choices here and what we choose depends on the outcome we want:

Remove all null values. This would delete the entire row that contains a null value. We would use dropna() for this.
Replace all null values with a 0. This is beneficial if the column is an Integer such as our count column. We would use fillna(0) to achieve this.
Replace all null values with some other value or if we wish, the value of another column. We would again use fillna(n) function. Replacing the n with the new value or column name.

For our use case, given the nature of the column, we’ll take option 3 and we’ll replace the null values with the value of it’s parent column offence_category.

df.sub_category.fillna(df.offence_category, inplace=True)

Note I used inplace=true. This commits the changes permanently to the df data frame.

As a quick visual check, I can see row 2728990 now has the same sub_category as it’s parent offence_category. When I look back at my null value snapshot from earlier, I can see it used to be a null value.
Note I didn’t use df.head() to display the output. This time I used df on it’s own. This will return the whole data frame, truncated to display the top and bottom 5 records.

Let’s now re-test for null values just to confirm we got them all.

df[df.isnull().any(axis=1)]

Great news, we have no null values!

Our next stage of cleansing is looking at the count column. I can see there’s entries with a count of 0. There’s not much value in keeping these as we’re looking to visualise where offences were committed, not where they weren’t committed. To save database space, we’ll remove them.

df=df[df['count'] != 0]

Our next part of the cleansing process will be to filter the date. I don’t need data in my database all the way back to 1995 and for good data retention and storage practice (minimise your footprint by keeping what you need and only what you need). We’ll filter back to 1 January 2021 and retain a couple of years of data.

df=df[(df['date'] >= '2021-01-01')]

Using df again, we can see our 1st 5 records start with date 1 January 2021 and our last 5 records are 1 December 2022 (the latest date in the dataset).

For our final cleansing. I noticed earlier that there’s some text appended with *. Let’s remove this to tidy up the text so it doesn’t come through on our visualisation in Part 2.

df.replace({'\*':''}, regex=True, inplace=True)

Note the \ in the condition, this is because * is a special character and omitting the \ will return an error.

That’s it for our data cleansing, we’re ready to start using the data.

Step 4 — Uploading into a Mysql database.

As this is a shortened example derived from a real production job I run, I’ll include an example of loading the finished data into a Mysql database for long term storage.

It isn’t strictly necessary for what we’re doing in this series but for those who may want to know how to load data into Mysql, it’s a useful example.

In Part 2, i’ll also show you how retrieve the data from the database and use it in Plotly.

You will need to install Mysql onto your local machine or have access to it remotely. I won’t provide a tutorial on how to install Mysql or how to create the database but Google returns plenty of examples if needed.
in Mysql, you’ll want to create a user crimes, a database named crimes and a table named crimes. Create columns to fit the data with the following attributes; id (int, auto assign), state (var, 3), offence_category (var, 255), sub_category (var, 255), date (date), count (int), LGA(var, 255)

We’ll be using the sqlalchemy package and pymysql DB driver for this example. There’s several others out there and I encourage you to explore them to become familiar with them.

Make sure you enter your DB host details (it might be localhost or 127.0.0.1 if you’re working on a local machine) and your Mysql user password.

# import the module
from sqlalchemy import create_engine

# create sqlalchemy engine
engine = create_engine("mysql+pymysql://{user}:{pw}@enter the mysql host ip/{db}"
                       .format(user="crimes",
                               pw="enter your password",
                               db="crimes"))

df.to_sql('crimes', con = engine, if_exists = 'append', index=False, chunksize = 1000)

Sqlalchemy allows us to batch load the data into the DB in chunks, rather than loading it row by row (which can take a long time). There’s pro’s and cons for each method and you should research which is best for your specific circumstances.

We’ll break the batch load up into chunks of 1000 records. As we have an auto generated id column in the DB, i’ve set index=False for this example. This stops the load of the index column from the pandas data frame.

I’ve also set if_exists = append. As the name suggests, if there’s data already in the table, it will append this new data under the existing data. Other options include replace (overwrites the existing data) or fail (stops the load and returns an error message).

If all goes well, you’ll see an output message with the total rows loaded. Hopefully it matches the total rows in your data frame. You can check it against a your data frame using len() .

len(df.index)

Success! We had 90249 rows uploaded into Mysql and 90249 rows in the data frame.

Your row counts may differ to this example as we’re using a live 3rd party data source that is updated regularly.

Conclusion

In this tutorial we’ve successfully learnt how to set up a Juptyer notebook to read an excel file into a Pandas data frame, manipulate and cleanse the data to prepare it for future use and upload the data into a Mysql database for permanent retention.

In Part 2 we will use Jupyter Notebook to retrieve the data from Mysql, undertake some simple statistical analysis of the data and create various visualisations of the data using Plotly.

If you would like notification of when Part 2 is released and/or other tutorials and articles I release, please follow me and hit the subscribe button to get email notification.

Below is the code for the entire notebook for reference.


# In[154]:


import pandas as pd
from datetime import datetime
from numpy import nan


# In[155]:


df = pd.read_excel('https://www.bocsar.nsw.gov.au/Documents/Datasets/RCI_offencebymonth.xlsm', sheet_name='Data')



# In[156]:


df.head(5) #return 5 rows


# In[157]:


df.dtypes


# In[158]:


df=df.melt(id_vars=['LGA','Offence category','Subcategory'], var_name='Date', value_name='Count')


# In[159]:


df.head(5) #return 5 rows


# In[160]:


df.dtypes


# In[161]:


df['state']='nsw'
df['offence_category']=df['Offence category']
df['sub_category']=df['Subcategory']
df['count']=df['Count']
df['date']=df['Date']
df['lga']=df['LGA']


# In[162]:


df.head(5) #return 5 rows


# In[163]:


df = df.drop(columns=['Offence category', 'Subcategory', 'Date', 'Count', 'LGA'], errors='ignore')


# In[164]:


df.head(5) #return 5 rows


# In[165]:


df[df.isnull().any(axis=1)]


# In[166]:


df.sub_category.fillna(df.offence_category, inplace=True)


# In[167]:


df


# In[168]:


df[df.isnull().any(axis=1)]


# In[169]:


df=df[df['count'] != 0]


# In[170]:


df.head(5) #return 5 rows


# In[171]:


df=df[(df['date'] >= '2021-01-01')]


# In[172]:


df


# In[173]:


df.replace({'\*':''}, regex=True, inplace=True)


# In[174]:


df[df['sub_category'].str.contains("Murder",  na = False)]


# In[175]:


df


# In[176]:


# import the module
from sqlalchemy import create_engine

# create sqlalchemy engine
engine = create_engine("mysql+pymysql://{user}:{pw}@xxxxxxxxxx/{db}"
                       .format(user="crimes",
                               pw="xxxxxxx",
                               db="crimes"))


# In[177]:


df.to_sql('crimes', con = engine, if_exists = 'append', index=False, chunksize = 1000)


# In[179]:


len(df.index)


# In[ ]: