Day 28 of 100DaysofML
Kaggle Titanic dataset Part 2. So in this blog, I thought of proceeding with cleaning my dataset and passing my dataset through a few models to see how well my model performs and take the model with best accuracy at the end.
So the one thing that we analyzed from yesterday’s dataset is that two columns, mainly Cabin and Ticket have a lot of anomalies or null values. So I decided to drop these two columns from the training as well as testing dataset.
train_data = train_data.drop(['Cabin'], axis = 1)
test_data = test_data.drop(['Cabin'], axis = 1)train_data = train_data.drop(['Ticket'], axis = 1)
test_data = test_data.drop(['Ticket'], axis = 1)
Now, the training data looks a little like this:
From the dataset, I noticed that the embarked column had a few N/A values so I decided to quickly replace those values using the .fillna() in pandas.
#replacing the missing values in the Embarked feature with S
train_data = train_data.fillna({"Embarked": "S"})
The next few steps might seem a little overwhelming but if you follow them step by step, you may get a rough idea of what I am doing.
So, it is better if we conduct all the changes on a single dataset rather than making the changes on the training and testing dataset separately. This is why, we shall combine them into a single dataset.
#create a combined group of both datasets
combine = [train_data, test_data]
#extract a title for each Name in the train and test datasets
for dataset in combine:
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(train_data['Title'], train_data['Sex'])
The output we obtain after this step is shown below:
Here, we may see that the first column has a number of values and it is essential that we reduce the total number of rows in these or categorize the rarely occurring rows into a seperate row, for this, we carry out the following segment of code.
#replace various titles with more common names
for dataset in combine:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Capt', 'Col',
'Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace(['Countess', 'Lady', 'Sir'], 'Royal')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')train_data[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
The output obtained after this data cleaning stage is shown below:
This data finally looks a little more clean and is ready for the next stage of processing. So the next bit of code is mainly for replacing all the missing values in the dataset and this could be done in a number of ways but I chose to use the most commonly occurring value or the mode.
I faced a few difficulties in modifying the values of the Age column so I decided to drop the column temporarily. Check out the following code below:
#drop the name feature since it contains no more useful information.
train_data = train_data.drop(['Name'], axis = 1)
test_data = test_data.drop(['Name'], axis = 1)
There is a concept called OneHotEncoding which I have mentioned about in my previous blogs so you could use the method of OHE or you could create a map and set the values based on the occurrence from the hashmap or dictionary. We do this for several columns such as Sex, Embarked and Fare values.
#map each Sex value to a numerical value
sex_mapping = {"male": 0, "female": 1}
train_data['Sex'] = train_data['Sex'].map(sex_mapping)
test_data['Sex'] = test_data['Sex'].map(sex_mapping)#map each Embarked value to a numerical value
embarked_mapping = {"S": 1, "C": 2, "Q": 3}
train_data['Embarked'] = train_data['Embarked'].map(embarked_mapping)
test_data['Embarked'] = test_data['Embarked'].map(embarked_mapping)for x in range(len(test_data["Fare"])):
if pd.isnull(test_data["Fare"][x]):
pclass = test_data["Pclass"][x] #Pclass = 3
test_data["Fare"][x] = round(train_data[train_data["Pclass"] == pclass]["Fare"].mean(), 4)
#map Fare values into groups of numerical values
train_data['FareBand'] = pd.qcut(train_data['Fare'], 4, labels = [1, 2, 3, 4])
test_data['FareBand'] = pd.qcut(test_data['Fare'], 4, labels = [1, 2, 3, 4])#drop Fare values
train_data = train_data.drop(['Fare'], axis = 1)
test_data = test_data.drop(['Fare'], axis = 1)
Now, when we check our dataset, the text values would be converted into numerical based values and the dataset is finally ready to be processed by the model. I’m going to stop here because there is a lot of code and syntax to be understood here. So I will talk about the implementation in models in tomorrow’s blog.
That’s it for today. Keep Learning.
Cheers.