Tackling the Titanic Dataset with Machine Learning (Kaggle Challenge!)

Sanjay Dutta
6 min readMay 30, 2024

--

The Titanic dataset is a classic machine learning problem. It provides information about the passengers aboard the Titanic, and the goal is to predict whether a passenger survived or not based on various features such as age, gender, class, and more. This project is an excellent introduction to data cleaning, feature engineering, and model building.

In this blog post, we’ll walk through the entire process of tackling the Titanic dataset using Python and scikit-learn, covering the following steps:

  1. Load the Data
  2. Explore the Data
  3. Clean the Data
  4. Feature Engineering
  5. Select and Train a Model
  6. Evaluate the Model
  7. Fine-Tune the Model
  8. Predict and Submit Results

Step 1: Load the Data

First, we need to load the data. The Titanic dataset is available on Kaggle, and you can download it from Titanic — Machine Learning from Disaster. Save the train.csv and test.csv files in your working directory.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the training and test datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

Step 2: Explore the Data

Before diving into cleaning and modeling, it’s essential to understand the data.

# Display the first few rows of the training dataset
print(train_df.head())

# Get a summary of the dataset
print(train_df.info())

# Get descriptive statistics
print(train_df.describe())

Output:

   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S

[5 rows x 12 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
PassengerId Survived Pclass ... SibSp Parch Fare
count 891.000000 891.000000 891.000000 ... 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 ... 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 ... 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 ... 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 ... 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 ... 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 ... 8.000000 6.000000 512.329200

[8 rows x 7 columns]

Step 3: Clean the Data

Data cleaning involves handling missing values and converting categorical features to numerical ones.

# Fill missing values for the 'Age' feature with the median age
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
test_df['Age'].fillna(test_df['Age'].median(), inplace=True)

# Fill missing values for the 'Embarked' feature with the most common port
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)
test_df['Embarked'].fillna(test_df['Embarked'].mode()[0], inplace=True)

# Fill missing values for the 'Fare' feature in the test set with the median fare
test_df['Fare'].fillna(test_df['Fare'].median(), inplace=True)

# Convert 'Sex' feature to numerical
train_df['Sex'] = train_df['Sex'].map({'male': 0, 'female': 1})
test_df['Sex'] = test_df['Sex'].map({'male': 0, 'female': 1})

# Convert 'Embarked' feature to numerical
train_df['Embarked'] = train_df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
test_df['Embarked'] = test_df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# Drop irrelevant features
train_df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test_df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

Step 4: Feature Engineering

Feature engineering involves creating new features from existing ones to improve the model’s predictive power.

# Create a new feature 'FamilySize' from 'SibSp' and 'Parch'
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1

# Create a new feature 'IsAlone'
train_df['IsAlone'] = np.where(train_df['FamilySize'] > 1, 0, 1)
test_df['IsAlone'] = np.where(test_df['FamilySize'] > 1, 0, 1)

Step 5: Select and Train a Model

We prepare the data for training and then train a RandomForestClassifier.

# Prepare the data for training
X = train_df.drop(['Survived'], axis=1)
y = train_df['Survived']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict on the validation set
y_pred = clf.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Validation set accuracy: {accuracy:.2f}')

Output:

Validation set accuracy: 0.82

Step 6: Fine-Tune the Model

We use grid search to find the best hyperparameters for the model.

# Define the parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [4, 6, 8, 10],
'criterion': ['gini', 'entropy']
}

# Perform grid search
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print(f'Best parameters found by grid search: {best_params}')

# Train the classifier with the best parameters
best_clf = grid_search.best_estimator_
best_clf.fit(X_train, y_train

# Predict on the validation set
y_pred = best_clf.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Validation set accuracy after tuning: {accuracy:.2f}')

Output:

Best parameters found by grid search: {'criterion': 'gini', 'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 200}
Validation set accuracy after tuning: 0.80

Step 7: Predict and Submit Results

Finally, we use the trained model to predict the outcomes on the test set and prepare the submission file for Kaggle.

# Predict on the test set
test_pred = best_clf.predict(test_df)

# Prepare the submission file
submission = pd.DataFrame({
'PassengerId': test_df['PassengerId'],
'Survived': test_pred
})

submission.to_csv('titanic_submission.csv', index=False)

Sample Output:

Validation set accuracy: 0.82
Best parameters found by grid search: {'criterion': 'gini', 'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 200}
Validation set accuracy after tuning: 0.80
Test set predictions:
[0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0
1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0
1 1 1 1 0 0 1 0]
Submission file created: 'titanic_submission.csv'
Test set predictions saved to 'test_set_predictions.txt'

Conclusion

In this blog post, we tackled the Titanic dataset using Python and scikit-learn. We went through the entire process of loading the data, exploring it, cleaning it, performing feature engineering, training a model, fine-tuning it, and finally making predictions. By following these steps, you can build a robust machine learning model and submit your results to Kaggle.

Here is the complete code for your reference:

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the training and test datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Display the first few rows of the training dataset
print(train_df.head())

# Get a summary of the dataset
print(train_df.info())

# Get descriptive statistics
print(train_df.describe())

# Fill missing values for the 'Age' feature with the median age
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
test_df['Age'].fillna(test_df['Age'].median(), inplace=True)

# Fill missing values for the 'Embarked' feature with the most common port
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)
test_df['Embarked'].fillna(test_df['Embarked'].mode()[0], inplace=True)

# Fill missing values for the 'Fare' feature in the test set with the median fare
test_df['Fare'].fillna(test_df['Fare'].median(), inplace=True)

# Convert 'Sex' feature to numerical
train_df['Sex'] = train_df['Sex'].map({'male': 0, 'female': 1})
test_df['Sex'] = test_df['Sex'].map({'male': 0, 'female': 1})

# Convert 'Embarked' feature to numerical
train_df['Embarked'] = train_df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
test_df['Embarked'] = test_df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# Drop irrelevant features
train_df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test_df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

# Create a new feature 'FamilySize' from 'SibSp' and 'Parch'
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1

# Create a new feature 'IsAlone'
train_df['IsAlone'] = np.where(train_df['FamilySize'] > 1, 0, 1)
test_df['IsAlone'] = np.where(test_df['FamilySize'] > 1, 0, 1)

# Prepare the data for training
X = train_df.drop(['Survived'], axis=1)
y = train_df['Survived']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict on the validation set
y_pred = clf.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Validation set accuracy: {accuracy:.2f}')

# Define the parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [4, 6, 8, 10],
'criterion': ['gini', 'entropy']
}

# Perform grid search
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print(f'Best parameters found by grid search: {best_params}')

# Train the classifier with the best parameters
best_clf = grid_search.best_estimator_
best_clf.fit(X_train, y_train)

# Predict on the validation set
y_pred = best_clf.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Validation set accuracy after tuning: {accuracy:.2f}')

# Predict on the test set
test_pred = best_clf.predict(test_df)

# Prepare the submission file
submission = pd.DataFrame({
'PassengerId': test_df['PassengerId'],
'Survived': test_pred
})
submission.to_csv('titanic_submission.csv', index=False)

By following these steps, you can create a robust machine learning pipeline for the Titanic dataset, improving your model’s performance through data cleaning, feature engineering, and hyperparameter tuning.

Thanks for reading! If you enjoyed the article, make sure to clap! You can connect with me on Linkedin or follow me on Twitter. Thank you!

--

--

Sanjay Dutta

A Deep Learning researcher focusing the areas in computer vision and machine learning