Pandas Library in a Nutshell — Intro To Machine Learning #3

Hi, this is the Third Article on our journey about Machine Learning Algorithms, You can find here Part 1 and Part 2 (depending on your background that might not be required for reading this article). In this post I’m going to present to you Pandas, a very useful python library for data analysis and use it to make a quick exploration on the Titanic Dataset available on Kaggle.

Here is the Github Repo for the code used here.

Agenda:

  • Why data analysis?
  • Getting to Know Your Data
  • Data preparations and Transformations
  • Introduction to pandas library
  • Exploration on Titanic Dataset
  • Decision Tree
  • Model Evaluation
  • Create your Submission File

Why Data Analysis?

(1). Getting to Know Your Data

Data analysis helps to make sense of your data, most of the time the datasets don’t come ready to use on Machine learning algorithms and you will find data with some issues to work on as follows:

  • Illegal values
  • Misspellings
  • Missing values (believe me this one of the most common)
  • Outliers

Of course there are many others but we are going to go through an exploration on the Titanic Dataset and see what we find there.

(2). Data Preparation and Transformations

Data scientist spend a comparatively large amount of time in the data preparation phase of a project. Whether you call it data wrangling, data munging, or data janitor work, the Times article estimates 50%-80% of a data scientists’ time is spent on data preparation. We agree.

The data transformation process typically consists of multiple steps where each step we try to solve the problems mentioned above. In data transformations we convert a set of data values from the original format into the destination data format(e.g. removing illegal values, imputing missing values, applying a deterministic mathematical function to each point in a dataset, etc).

Pandas Library

Pandas is Python Data Analysis Library, pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools.

Primary object types:

  • DataFrame: rows and columns (like a spreadsheet)
  • Series: a single column

Exploration on Titanic Dataset

RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning of 15 April 1912, after colliding with an iceberg during her maiden voyage from Southampton to New York City. Of the 2,224 passengers and crew aboard, more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history.

GOAL: Our goal here it’s to go through the titanic dataset, make simple analysis and learn to predict whether a passenger survived or not.

(1). Prepare The Tools

In the above code we are importing some of the needed libraries, %matplotlib inline enables us to make plots inside notebook(I hope you are already using Jupiter), Seaborn it’s great for visualization, I’m not going to the details of the plots but you can find them inside .ipynb file on my github repo.

(2). Get To know the Data

When using pandas a function you will use is .read_csv() which allows to you to load files of comma separated values(e.g. csv file).

So here I’m loading the train.csv file which obviously contains the training examples and test.csv file which contains data we are going to use to make predictions, note that the train file we got the features(age, fare, embarked, sex, ..) And the label Survived(1 if the person survived and 0 otherwise) and the test file doesn’t contain the survived label(you have to use your trained model and predict).

Load The Titanic Train and Test Datasets
Display the first 5 rows
Display the last 5 rows

.head() it’s used to display the first five rows of the DataFrame and .tails() the last five rows. Five rows it’s the default number but you can set a specific number inside parentheses.

Shape of both datasets
Generate various summary statistics, excluding NaN values.
Process summary of a DataFrame
  • Missing Values
train
test
  • Imputing Missing Values

.fillna() we use to imput missing values in selected column, here for Embarked column we are replacing NaN with the most common value in this column which is “S”.

Filling missing values with the median of “Fare” column
Filling missing values with the mean of “Age” column

There are several techniques to deal with missing values, a simple one it to use the mean/median/mode.

  • Transformations

In the DataFrames we got there are some columns with categorical values (e.g. Sex has values male and female) but the ml model only understand numbers that’s why of the transformations above. I replaced male by 0 and female by 1, the transformation is also required for embarked variable.

(3). Preparing the Model

In the X matrix we removed the Survived Column as it is the target variable (y) we want to learn to predict.

(4). Model Evaluation

Got score of 78%

In the next article I’m going to go through the different ways to evaluate model performance.

Try out experiment: In scikit-learn it’s pretty awesome an easy to try out different algorithms, try to change the decision tree classifier to Random Forest Classifier (which is an ensemble model that combines multiple decision trees in a random way) and see how it changes the score, later in this series we are going to to on ensemble methods.

(5). Make Predictions

By this point as our test DataFrame contains PassengerId, we remove it because the trained model doesn’t use PassengerId as it’s not useful.

Create Ready File to Submit on Kaggle

Whenever you want to create a DataFrame you just pass a dict of key-value pairs. For this file we want to save the PassengerId column from the test dataset as kaggle will use the passenger id to mach with the predictions you made to evaluate your performance and assign a public leaderboard score.

Find the Code Here.

For this article we are done, pandas is great library and there is still more you can learn, I just showed you some of the common use cases for it. If you want to challenge yourself join the kaggle community and make your first submission in Titanic (a Getting Started Competition).

Resources:

Further Readings

Video

Next:

In the next article we are going to talk about Logistic Regression and The different ways to evaluate model performance.

Let me know what you think about this, If you enjoyed the writings then please use the ❤ heart below to recommend this article so that others can see it.

Happy learning.