Practical Data Science in Python

How to do practical data science research?

Put together principles of data representation, basic charting techniques for a research

Thomas Wu
Geek Culture

--

image from seaborn.pydata.org

In the previous post I discussed some github tutorials. In this post, I will talk about some data science topics: How to do data visualisation in python for your dataset. So, let’s get started.

In a traditional data science research, it has 3 steps: 1. define the dataset and your source, 2. define the research question you would like to explore, and 3. do the coding and produce the result, optionally in a data representation to justify your result.

  1. Datasets: State the region and the domain category that your data sets are about :

For example, in my example here, I will explore the data in Hong Kong. And the domain category here is Real Estate, I can choose from many datasets. In the example here the data set is set to be (1) Mortgage Loans Outstanding and (2) Property Price Indices.

2. Research Question: Formulate a statement about the domain category and region that you identified.

The research question is defined to be: How have the residential mortgage loans outstanding and property price indices changed over the past twenty years?

To be more objective, we should provide the source links to publicly accessible datasets. These could be links to files such as CSV or Excel files, or links to websites which might have data in tabular form, such as Wikipedia pages. Here are the links:

Link 1 (Private Domestic — Prices indices by Class): https://www.rvd.gov.hk/doc/en/statistics/his_data_4.xls

Link 2 (Residential mortgage survey results): https://www.hkma.gov.hk/media/eng/doc/market-data-and-statistics/monthly-statistical-bulletin/T0307.xlsx

3. Coding: From here, all is set except to get you hand dirty for some coding. We will use python and some of the libraries like pandas, matplotlib and numpy mainly. The coding process will invovle 3 parts: Preparation, Data Processing, and plan for data representation.

(i) Preparation : Take a look at the datasets, to have an idea of : a. what the data look like, and b. any missing data or outliner c. any data cleansing needed to be done.

Use Python pandas library to read excel data

(ii) Data processing: It invovle firstly reading the data to variable like Pandas Dataframe in python,

Let’s take Link 1 data as an Example. For example we can filter out header and footer and irrelevant columns and rows and store only the relevant data in a Dataframe using some built-in support of pandas

df1sh1 = pd.read_excel(r’./T0307.xlsx’, “T3.7”, usecols=[0,1,3], skiprows=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,17,30,43,56,69], skipfooter=10)
df1sh2 = pd.read_excel(r’./T0307.xlsx’, “T3.7 (old)” , usecols=[0,1,3], skiprows=62, skipfooter=4)

Rename the columns in dataframe:

df1sh1.rename(columns={‘Unnamed: 0’:’Year’,’Unnamed: 1':’Month’,’(百萬港元)’:’Amount’}, inplace=True)

Concat two dataframe (originating from 2 source excel worksheet etc.)

df1 = pd.concat([df1sh2, df1sh1])

Secondly do the transformations, groupings, etc.
Grouping time series data (e.g. monthly to yearly):

df1 = df1.groupby(‘Year’).agg({‘Amount’:sum}).reset_index()

Changing the metric units (e.g. from millions to billions)

df1[‘Amount’] = df1[‘Amount’]/1000 #in billions

We should then apply the same data processing to another dataset from Link 2, and I leave it for you as an exercise.

(iii) Think of how to represent the data. As a data scientist we should strive to show the inter-relationship and find out any insights from the dataset. I recommend Alberto Cairo’s work when it comes to the principles of truthly represent data. Pay attention to Graphic Lies, Misleading Visuals.

Use of Visualisation Wheel tool to plan you visuals

The basic tools for plotting in Python is Matplolib, and the reference website is remarkable for finding resources needed. There are three main layers in matplotlib architecture. From top to bottom, they are Scripting layer (matplotlib.pyplot module), Artist layer (matplotlib.artist module), and Backend layer (matplotlib.backend_bases module), respectively. We will mainly use the top level scripting layer to the basic plotting:

Plotting a bar chart, and setting some ticks and labels on axis:

bars = plt.bar(year, outstandings, align=’center’, linewidth=0, width = 0.5, color=’black’)plt.xticks(year)
plt.xlabel(‘Year’)
plt.ylabel(‘Total Loans Outstanding (in $ Billions)’, color = ‘green’)

We will sometimes coding on the middle artist layer to do some customisations, like : Rotating the labels by 45 degrees:

ax1.set_xticklabels(ax1.get_xticks(), rotation = 45)

and setting some axes to be invisible:

ax1.spines[‘top’].set_visible(False)
ax1.spines[‘left’].set_visible(False)

Finally, we can have our own work on the proposed research question:

In summary, this post discuss a general approach to create data representation for a data science research. I hope you learn something and Thanks for supporting my articles. If I have time later I am going to publish more on other data science topics like other basic chartings like heatmaps, boxplot, or machine learning topic, and more.

--

--

Thomas Wu
Geek Culture

An IT Architect. I write stories about software development, DevOps and data science