TGIFHacks #98 — Getting into Data Science with Pandas

Wilson Thurman Teng
NTUOSS
Published in
10 min readAug 31, 2019

Artwork by @hyde-brendan

This workshop features a hands-on approach to learning the basics of Pandas, Matplotlib, Seaborn & Scikit-learn. Familiarity with Python syntax or programming data types (e.g.integer,float,etc) would be useful but not required to complete this workshop.

For a hands-on walk-through of the workshop with step-by-step explanations of the code, a recording of the workshop is available (Skip to 17:39 if you have already successfully setup and uploaded the 3 required data files to Colab):

For the full code discussed in this workshop, please please head over to this Github repository.

For errors, typos or suggestions, please do not hesitate to post an issue! Pull requests are very welcome, thank you!

Disclaimer: This document is only meant to serve as a reference for the attendees of the workshop. For a full, comprehensive documentation of the python libraries discussed, please check the official documentations linked above.

1. Introduction

This workshop is primarily about getting yourselves started with Data Analysis. We will begin by covering some background on why pandas, then move on to how to get it running on Google Colab and then finally cover some of the basic commands which you’ll be using when working with Data Analysis.

Since this is a basic workshop, we’ll walk you through the setup phase for colab.

1.1 Google Colaboratory

Colab is a free Jupyter notebook environment that requires no setup, and runs entirely (writing, running, & sharing code) on the Cloud. For those of you familiar with Jupyter Notebooks, you can think of colab as a Jupyter Notebook stored in Google Drive. A typical notebook document is composed of cells, each of which can contain code, text or images.

1.2 Setting up Google Colab

Instructions for first-timers using Google Colab on Google Drive:

  1. Create a gmail account (if you do not have one already).
  2. Download the .zip file from this link.
  3. Extract the .zip file, find HDB_ResalePrices.ipynb.
  4. Create a a new Folder TGIF Pandas Basics on your Google Drive.
  5. Upload HDB_ResalePrices.ipynb inside this new folder.
  6. Click on HDB_ResalePrices.ipynb.
  7. Select “Open with”, select “Connect more apps”.
  8. Search for “colab”, press “Connect”.
  9. Select “Open with Google Colaboratory”.
  10. Drag the 3 files from Data (in your extracted .zip file) into Google Colab.

Instructions for those who have used Google Colab on Google Drive before:

  1. Download the .zip file from this link.
  2. Extract the .zip file, find HDB_ResalePrices.ipynb.
  3. Create a a new Folder TGIF Pandas Basics on your Google Drive.
  4. Upload HDB_ResalePrices.ipynb inside this new folder.
  5. Right-click on HDB_ResalePrices.ipynb, mouse-over "Open with" and select "Google Colaboratory".

6. Drag the 3 files from Data (in your extracted .zip file) into Google Colab.

1.3 Data Science Process

Data Analysis typically involves the following steps:

  1. Problem statement/shaping

A Data Science Project should be business oriented. Problem statements/shaping should be done with the intent of achieving results.Can you relate the problem to data? Would data help you in practice?

2. Data Collection

Collect relevant data that matches the problem.

3. Data Cleaning/Processing

Some common errors in data to lookout for are missing/corrupted values, datatype mismatches.

4. EDA(Exploratory Data Analysis)/Visualization

Find patterns in your data and compute vital statistics (e.g.Mean,Median, Standard deviation, etc)

5. Data Modelling

Split your data in train and test sets. Create a machine learning model from your training set for prediction.

6. Data Evaluation

Evaluate your model using a suitable metric (e.g.Mean squared error, f1 score, etc) and try to improve the score that metric.

2. HDB Resale Prices Example

Dataset taken from Data.gov.sg

Now that we’ve discussed the Data Science process, let’s dive into our example. As some of you may have guessed, we are going to predict the price of HDB resale flats!

Problem Statement:Can we predict the the resale price of a house given the floor area of the house (in m2)?

As such, we are using the floor_area_sqm feature to predict resale_price.

2.1 Import Libraries

2.1.1 For Data Processing/Cleaning

Pandas is typically used for data manipulation & analysis and is highly optimized with critical code paths written in Cython or C++.

Numpy is used to manipulate large, multi-dimensional arrays and matrices, which is very useful for processing data for Data modelling.

2.1.2 For Visualization

Matplotlib is a multi-platform data visualization library built on NumPy arrays.

Seaborn is based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

2.1.3 For Machine Learning

Scikit-learn provides a range of supervised and unsupervised machine learning algorithms.

Extra: Keras is a high level API typically used for rapid prototyping on small data sets.

Extra: Tensorflow is a framework that provides both high and low level APIs and is typically used on large data sets because of it’s high performance.

Extra: Pytorch is lower-level API focused on direct work with array expressions. It has gained immense interest recently, becoming a preferred solution for academic research, and applications of deep learning requiring optimizing custom expressions.

2.2 Reading data into Dataframe format

Before we can do any Datascience work, we must convert the data we want to analyse into a Dataframe object. In this case, our data is in the .csv format.

2.3 Data Cleaning/Processing

Now we are ready to clean our data!

2.3.1 Concatenating separate data frames

Sometimes, you may get your data separated into multiple files although they have the same formats. This is especially common since the limit for excel is 1,048,576 rows and datasets are typically very large, especially with deep learning becoming more popular.

2.3.2 Remove features unnecessary for your problem

After exploring your data, you may want to remove features that do not contribute to your problem, and are hence unnecessary. Removing features from your dataframe helps to prevent cluttering and allows you to save memory.

2.3.3 Remove NaN values from Dataframe

Within Pandas, a missing value is denoted by NaN. You can either choose to remove rows which contain NaN datapoints or replace these values. For example, replace it with the result of a function of the other numerical features. This will be highly dependent on the context of your problem. For the purpose of this workshop, we will remove the NaN values.

2.3.4 Sorting Dataframe

For ease of analysis, you may want to sort your dataframe according to a few features. For the example below, the order of sorting will be the following: ['feature1', 'feature2', 'feature3', 'feature4']

2.3.5 Changing dtype of features to better suit context of Dataset

Before we continue, this is a overview of the common datatypes you will encounter in Pandas. To find out more, check out the official Pandas Documentation.

We will first have to find out the datatype for each feature in the dataframe. This can be done using the.info() method or the .dtypes function.

Next, dictate the feature and the datatype you want to convert to. If there are no errors, the datatypes you want to convert to is compatible and conversion is successful.

Use the .info() method or .dtypes function to check that the datatype conversion is successful.

2.3.6 Feature Engineering

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. Put simply, feature engineering turns your inputs into features the algorithm can understand and helps with your Data Modelling.

The code above will create a new feature column.

Extra: using a function to create an engineered feature

Use the .apply() method to create your engineered feature.

2.4 Visualization

Data visualization helps to encode data into visual objects (Lines, Points, Bars, etc). This helps you to understand your data better as well as communicate this information more efficiently to stakeholders.

Uni-variate Analysis refers to the analysis of only 1 variable at a time.

2.4.1(a) Boxplot

The orient argument gives you the flexibility to change the orientation of the Boxplot. In this example, the "h"(Horizontal) orientation is specified. Try changing it to"v"(Vertical)!

Understanding a Boxplot

2.4.1(b) Distplot

This will draw a histogram and fit a kernel density estimate (KDE).

Understanding KDE

2.4.1(c) Violinplot

A Violin Plot is used to visualize the distribution of the data and its probability density. It is the combination of a Boxplot and a KDE Plot that is rotated and placed on each side, to show the distribution shape of the data.

Understanding a Violinplot

2.4.1 Extra: Grouping graphs together

You may want to group your graphs together for easier analysis.

The plt.subplots method returns a figure object , f and a 2D-array of axes object, axes.

  • f is the "canvas" you will be "painting" your graphs onto.
  • axes is a 2D-array dependent on what you have specified for the rows & columns argument. In the example above, axes = [[subplot_1, subplot_2, subplot_3], [subplot_4, subplot_5, subplot_6]]

2.4.2 Using Matplotlib/Seaborn for Multi-variate Analysis

It can also be useful to visualize a bivariate distribution of two variables. Multi-variate Analysis refers to the analysis of only multiple variables at the same time.

The easiest way to do this in seaborn is to just use the .jointplot() function. The jointplot plots the bivariate scatterplot and histograms.

2.4.2(a) Jointplot

2.4.2(b) Heatmap

We can also use a heatmap to plot the correlations between 2 variables.

2.4.3 Analyzing Categories

2.4.3(a) Catplot

After converting some of our features into categorical datatypes to categories (as we have learnt above in 2.3.5), we can also do some analysis on them!

2.5 Data Modelling (Machine Learning)

2.5.1 What is Linear Regression

Linear Regression is all about trying to find the best fit line for your datapoints. The intuition is to minimize the “Sum of Squared Errors” (SSE), which is the sum difference between actual value and predicted value.

The goal is to eventually generalize the dataset and predict the dependent variable with reasonable accuracy given the independent variable.

2.5.1 Splitting our dataset in Train and Test sets

Before we begin Data Modelling, we have to split our dataset into train and test sets so we have a measure to evaluate our model’s performance.

Explanation of code

  1. We first reshape our X and y to get a 1D-array.
  2. Next, we split our dataset into X_train, X_test, y_train and y_test with 20% of our dataset as the test set.
  3. Lastly, we print out the shape of X_train, X_test, y_train and y_test.

2.5.3 Visual Representation of the Linear Regression Model

Explanation of code

  1. We first create a LinearRegression Object linreg.
  2. Then, using we fit our linreg object with our training set data - X_train and y_train.
  3. Afterwards, we get the save the x-coordinates and y-coordinates of the linreg line into regline_x and regline_y respectively.
  4. Lastly, we plot all the datapoints of the train dataset as well as the linreg best-fit line.

2.5.2 Prediction of Response based on the Predictor

Explanation of code

  1. We first save the predicted y-coordinates from X_test into resale_price_pred.
  2. Next, we plot all the datapoints in the test set.
  3. Lastly, we plot the resale_price that we have predicted from X_test using our best-fit line.

3. Further Learning

3.1 Extension of today’s workshop:

Other interesting questions

  • What is the best/worse times of the year to sell your house? By how much do prices change?
  • How is the travel time from different locations to central area affecting the HDB resale price?

Interesting Datasets to consider

3.2 Setting up your Data Analytics Environment!

3.2.1 Anaconda Distribution

Anaconda is one of several Python distributions. Python on it’s own is not going to be useful unless an IDE is installed. This is where Anaconda comes into picture.

The following are some of the default applications/libraries available by default that i personally find useful:

  • Jupyter Notebook (Data Science Environment)
  • JupyterLab (Simply put, Jupyter Notebook on the web)
  • Rstudio (For those who are less comfortable with programming)
  • Visual Studio Code (One of the most powerful/easy to use code editors out there! Props to Microsoft! Or you can use VSCodium for those who are worried of Telemetry data collection)
  • Tons of Datascience Libraries pre-installed!

3.2.2 Step-by-Step Setup guide — Guide by Datacamp

Setup on Windows
Setup on Mac

3.3 Get Rewarded??

Kaggle is where you can join Data Science competitions either alone or as a team and earn prize money. Although i would highly recommend to check out the “Getting Started” Category first.

Originally published at http://github.com.

--

--

Wilson Thurman Teng
NTUOSS
Editor for

Computer Science Student at Nanyang Technological University.