Tackling the Titanic Dataset with Machine Learning (Kaggle Challenge!)

6 min readMay 30, 2024

The Titanic dataset is a classic machine learning problem. It provides information about the passengers aboard the Titanic, and the goal is to predict whether a passenger survived or not based on various features such as age, gender, class, and more. This project is an excellent introduction to data cleaning, feature engineering, and model building.

In this blog post, we’ll walk through the entire process of tackling the Titanic dataset using Python and scikit-learn, covering the following steps:

Load the Data
Explore the Data
Clean the Data
Feature Engineering
Select and Train a Model
Evaluate the Model
Fine-Tune the Model
Predict and Submit Results

Step 1: Load the Data

First, we need to load the data. The Titanic dataset is available on Kaggle, and you can download it from Titanic — Machine Learning from Disaster. Save the train.csv and test.csv files in your working directory.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the training and test datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

Step 2: Explore the Data

Before diving into cleaning and modeling, it’s essential to understand the data.

# Display the first few rows of the training dataset
print(train_df.head())

# Get a summary of the dataset
print(train_df.info())

# Get descriptive statistics
print(train_df.describe())

Output:

   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
       PassengerId    Survived      Pclass  ...       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  ...  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642  ...    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071  ...    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000  ...    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000  ...    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000  ...    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000  ...    8.000000    6.000000  512.329200

[8 rows x 7 columns]

Step 3: Clean the Data

Data cleaning involves handling missing values and converting categorical features to numerical ones.

# Fill missing values for the 'Age' feature with the median age
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
test_df['Age'].fillna(test_df['Age'].median(), inplace=True)

# Fill missing values for the 'Embarked' feature with the most common port
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)
test_df['Embarked'].fillna(test_df['Embarked'].mode()[0], inplace=True)

# Fill missing values for the 'Fare' feature in the test set with the median fare
test_df['Fare'].fillna(test_df['Fare'].median(), inplace=True)

# Convert 'Sex' feature to numerical
train_df['Sex'] = train_df['Sex'].map({'male': 0, 'female': 1})
test_df['Sex'] = test_df['Sex'].map({'male': 0, 'female': 1})

# Convert 'Embarked' feature to numerical
train_df['Embarked'] = train_df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
test_df['Embarked'] = test_df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# Drop irrelevant features
train_df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test_df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

Step 4: Feature Engineering

Feature engineering involves creating new features from existing ones to improve the model’s predictive power.

# Create a new feature 'FamilySize' from 'SibSp' and 'Parch'
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1

# Create a new feature 'IsAlone'
train_df['IsAlone'] = np.where(train_df['FamilySize'] > 1, 0, 1)
test_df['IsAlone'] = np.where(test_df['FamilySize'] > 1, 0, 1)

Step 5: Select and Train a Model

We prepare the data for training and then train a RandomForestClassifier.

# Prepare the data for training
X = train_df.drop(['Survived'], axis=1)
y = train_df['Survived']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict on the validation set
y_pred = clf.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Validation set accuracy: {accuracy:.2f}')

Output:

Validation set accuracy: 0.82

Step 6: Fine-Tune the Model

We use grid search to find the best hyperparameters for the model.

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [4, 6, 8, 10],
    'criterion': ['gini', 'entropy']
}

# Perform grid search
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print(f'Best parameters found by grid search: {best_params}')

# Train the classifier with the best parameters
best_clf = grid_search.best_estimator_
best_clf.fit(X_train, y_train

# Predict on the validation set
y_pred = best_clf.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Validation set accuracy after tuning: {accuracy:.2f}')

Output:

Best parameters found by grid search: {'criterion': 'gini', 'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 200}
Validation set accuracy after tuning: 0.80

Step 7: Predict and Submit Results

Finally, we use the trained model to predict the outcomes on the test set and prepare the submission file for Kaggle.

# Predict on the test set
test_pred = best_clf.predict(test_df)

# Prepare the submission file
submission = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': test_pred
})

submission.to_csv('titanic_submission.csv', index=False)

Sample Output:

Validation set accuracy: 0.82
Best parameters found by grid search: {'criterion': 'gini', 'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 200}
Validation set accuracy after tuning: 0.80
Test set predictions:
[0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0]
Submission file created: 'titanic_submission.csv'
Test set predictions saved to 'test_set_predictions.txt'

Conclusion

In this blog post, we tackled the Titanic dataset using Python and scikit-learn. We went through the entire process of loading the data, exploring it, cleaning it, performing feature engineering, training a model, fine-tuning it, and finally making predictions. By following these steps, you can build a robust machine learning model and submit your results to Kaggle.

Here is the complete code for your reference:

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the training and test datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Display the first few rows of the training dataset
print(train_df.head())

# Get a summary of the dataset
print(train_df.info())

# Get descriptive statistics
print(train_df.describe())

# Fill missing values for the 'Age' feature with the median age
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
test_df['Age'].fillna(test_df['Age'].median(), inplace=True)

# Fill missing values for the 'Embarked' feature with the most common port
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)
test_df['Embarked'].fillna(test_df['Embarked'].mode()[0], inplace=True)

# Fill missing values for the 'Fare' feature in the test set with the median fare
test_df['Fare'].fillna(test_df['Fare'].median(), inplace=True)

# Convert 'Sex' feature to numerical
train_df['Sex'] = train_df['Sex'].map({'male': 0, 'female': 1})
test_df['Sex'] = test_df['Sex'].map({'male': 0, 'female': 1})

# Convert 'Embarked' feature to numerical
train_df['Embarked'] = train_df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
test_df['Embarked'] = test_df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# Drop irrelevant features
train_df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test_df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

# Create a new feature 'FamilySize' from 'SibSp' and 'Parch'
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1

# Create a new feature 'IsAlone'
train_df['IsAlone'] = np.where(train_df['FamilySize'] > 1, 0, 1)
test_df['IsAlone'] = np.where(test_df['FamilySize'] > 1, 0, 1)

# Prepare the data for training
X = train_df.drop(['Survived'], axis=1)
y = train_df['Survived']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict on the validation set
y_pred = clf.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Validation set accuracy: {accuracy:.2f}')

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [4, 6, 8, 10],
    'criterion': ['gini', 'entropy']
}

# Perform grid search
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print(f'Best parameters found by grid search: {best_params}')

# Train the classifier with the best parameters
best_clf = grid_search.best_estimator_
best_clf.fit(X_train, y_train)

# Predict on the validation set
y_pred = best_clf.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f'Validation set accuracy after tuning: {accuracy:.2f}')

# Predict on the test set
test_pred = best_clf.predict(test_df)

# Prepare the submission file
submission = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': test_pred
})
submission.to_csv('titanic_submission.csv', index=False)

By following these steps, you can create a robust machine learning pipeline for the Titanic dataset, improving your model’s performance through data cleaning, feature engineering, and hyperparameter tuning.

Thanks for reading! If you enjoyed the article, make sure to clap! You can connect with me on Linkedin or follow me on Twitter. Thank you!

Tackling the Titanic Dataset with Machine Learning (Kaggle Challenge!)

Step 1: Load the Data

Step 2: Explore the Data

Step 3: Clean the Data

Step 4: Feature Engineering

Step 5: Select and Train a Model

Step 6: Fine-Tune the Model

Step 7: Predict and Submit Results

Conclusion

Written by Sanjay Dutta