Hi, this is the Third Article on our journey about Machine Learning Algorithms, You can find here Part 1 and Part 2 (depending on your background that might not be required for reading this article). In this post I’m going to present to you Pandas, a very useful python library for data analysis and use it to make a quick exploration on the Titanic Dataset available on Kaggle.
Here is the Github Repo for the code used here.
Agenda:
- Why data analysis?
- Getting to Know Your Data
- Data preparations and Transformations
- Introduction to pandas library
- Exploration on Titanic Dataset
- Decision Tree
- Model Evaluation
- Create your Submission File
Why Data Analysis?
(1). Getting to Know Your Data
Data analysis helps to make sense of your data, most of the time the datasets don’t come ready to use on Machine learning algorithms and you will find data with some issues to work on as follows:
- Illegal values
- Misspellings
- Missing values (believe me this one of the most common)
- Outliers
Of course there are many others but we are going to go through an exploration on the Titanic Dataset and see what we find there.
(2). Data Preparation and Transformations
The data transformation process typically consists of multiple steps where each step we try to solve the problems mentioned above. In data transformations we convert a set of data values from the original format into the destination data format(e.g. removing illegal values, imputing missing values, applying a deterministic mathematical function to each point in a dataset, etc).
Pandas Library
Pandas is Python Data Analysis Library, pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools.
Primary object types:
- DataFrame: rows and columns (like a spreadsheet)
- Series: a single column
Exploration on Titanic Dataset
RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning of 15 April 1912, after colliding with an iceberg during her maiden voyage from Southampton to New York City. Of the 2,224 passengers and crew aboard, more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history.
GOAL: Our goal here it’s to go through the titanic dataset, make simple analysis and learn to predict whether a passenger survived or not.
(1). Prepare The Tools
In the above code we are importing some of the needed libraries, %matplotlib inline enables us to make plots inside notebook(I hope you are already using Jupiter), Seaborn it’s great for visualization, I’m not going to the details of the plots but you can find them inside .ipynb file on my github repo.
(2). Get To know the Data
When using pandas a function you will use is .read_csv() which allows to you to load files of comma separated values(e.g. csv file).
So here I’m loading the train.csv file which obviously contains the training examples and test.csv file which contains data we are going to use to make predictions, note that the train file we got the features(age, fare, embarked, sex, ..) And the label Survived(1 if the person survived and 0 otherwise) and the test file doesn’t contain the survived label(you have to use your trained model and predict).
.head() it’s used to display the first five rows of the DataFrame and .tails() the last five rows. Five rows it’s the default number but you can set a specific number inside parentheses.
- Missing Values
- Imputing Missing Values
.fillna() we use to imput missing values in selected column, here for Embarked column we are replacing NaN with the most common value in this column which is “S”.
There are several techniques to deal with missing values, a simple one it to use the mean/median/mode.
- Transformations
In the DataFrames we got there are some columns with categorical values (e.g. Sex has values male and female) but the ml model only understand numbers that’s why of the transformations above. I replaced male by 0 and female by 1, the transformation is also required for embarked variable.
(3). Preparing the Model
In the X matrix we removed the Survived Column as it is the target variable (y) we want to learn to predict.
(4). Model Evaluation
In the next article I’m going to go through the different ways to evaluate model performance.
Try out experiment: In scikit-learn it’s pretty awesome an easy to try out different algorithms, try to change the decision tree classifier to Random Forest Classifier (which is an ensemble model that combines multiple decision trees in a random way) and see how it changes the score, later in this series we are going to to on ensemble methods.
(5). Make Predictions
By this point as our test DataFrame contains PassengerId, we remove it because the trained model doesn’t use PassengerId as it’s not useful.
Create Ready File to Submit on Kaggle
Whenever you want to create a DataFrame you just pass a dict of key-value pairs. For this file we want to save the PassengerId column from the test dataset as kaggle will use the passenger id to mach with the predictions you made to evaluate your performance and assign a public leaderboard score.
Find the Code Here.
For this article we are done, pandas is great library and there is still more you can learn, I just showed you some of the common use cases for it. If you want to challenge yourself join the kaggle community and make your first submission in Titanic (a Getting Started Competition).
Resources:
Further Readings
- Installation instructions and documentation
- read_csv and read_table documentation
- Getting Started with Pandas — Predicting SAT Scores for New York City Schools
- New York Times article “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”
- CHAPTER 5: Getting Started with pandas on Python for Data Analysis Book
Video
Next:
In the next article we are going to talk about Logistic Regression and The different ways to evaluate model performance.
Let me know what you think about this, If you enjoyed the writings then please use the ❤ heart below to recommend this article so that others can see it.
Happy learning.