Titanic Survival Prediction(Kaggle challenge)

Abhisangh Singh Arora
7 min readMar 11, 2023

--

https://www.kaggle.com/competitions/titanic/overview

Introduction:

The sinking of the Titanic is one of the most tragic incidents in the history of mankind. It claimed the lives of over 1500 passengers and crew members. The tragedy has inspired many books, movies, and documentaries. The Titanic ML competition on Kaggle provides a unique opportunity for data enthusiasts to dive into the world of machine learning and explore the factors that determined the survival of passengers on that fateful night.

The Challenge:

The challenge is simple. Participants are required to use machine learning techniques to develop a predictive model that can identify which passengers on board the Titanic were more likely to survive. The data provided includes information such as the name, age, gender, socio-economic class, and other relevant details of the passengers.

Data Exploration:

The first step in any data analysis task is to explore the data. The Titanic dataset provided on Kaggle contains information on 891 passengers. We can use tools like Python and Pandas to load the data and perform exploratory analysis.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

train_data = pd.read_csv('./data/train.csv')
test_data = pd.read_csv('./data/test.csv')
print(train_data.shape)
print(test_data.shape)
print(train_data.head)
print(test_data.head)

women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

We can use various visualizations to gain insights into the data. For instance, we can plot the number of passengers who survived versus those who did not. We can also plot the survival rate against different variables such as age, gender, and socio-economic class. These visualizations can help us identify any patterns or trends in the data.

Feature Engineering:

Feature engineering is a crucial step in any machine-learning project. It involves selecting and transforming the relevant features in the dataset to improve the performance of the model. In the Titanic dataset, we can create new features such as family size, title, and deck from the existing data. We can also perform imputation to fill in any missing values in the data.



from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Model Selection:

Once we have prepared the data, we can proceed to select an appropriate machine-learning algorithm. There are several algorithms that we can use for this task, such as logistic regression, decision trees, and random forests. We can also use ensemble techniques such as boosting and bagging to improve the performance of the model.

Model Evaluation:

After training the model, we need to evaluate its performance on a validation dataset. We can use metrics such as accuracy, precision, and recall to measure the performance of the model. We can also use techniques such as cross-validation to obtain a more accurate estimate of the model’s performance.

MY CONTRIBUTION

COLLAB LINK:- https://colab.research.google.com/drive/1P31o9EZgJLChp44hYAeNM5PsZGoky0fz?usp=sharing

  1. Preprocessing the Data:

The third step is to preprocess the data. We first fill in the missing values in the ‘Age’ and ‘Fare’ columns with the median value of each column. We also fill in the missing values in the ‘Embarked’ column with the most common value. We then encode the ‘Sex’ column using LabelEncoder and convert the ‘Embarked’ column into one-hot encoded features using get_dummies.

# Preprocess the data
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
test_df['Age'].fillna(test_df['Age'].median(), inplace=True)
test_df['Fare'].fillna(test_df['Fare'].median(), inplace=True)

train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)
test_df['Embarked'].fillna(test_df['Embarked'].mode()[0], inplace=True)

encoder = LabelEncoder()
train_df['Sex'] = encoder.fit_transform(train_df['Sex'])
test_df['Sex'] = encoder.transform(test_df['Sex'])

train_df = pd.get_dummies(train_df, columns=['Embarked'])
test_df = pd.get_dummies(test_df, columns=['Embarked'])

2. Splitting the Data:

The next step is to split the data into training and validation sets using the train_test_split function from sci-kit-learn. We use 85% of the data for training and 15% of the data for validation.

# Split the data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_df.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1), train_df['Survived'], test_size=0.15, random_state=0)

3. Training the Random Forest Classifier:

The next step is to train the Random Forest Classifier. We use 130 trees and a random state of 0. We then train the model for each number of trees and track the accuracy scores for both the training and validation sets.

# Train a Random Forest classifier
n_estimators=130
rf = RandomForestClassifier(n_estimators=130, random_state=0)
rf.fit(X_train, y_train)

train_accuracy_scores = []
val_accuracy_scores = []

# Train the model for each number of trees and track accuracy scores
for n_trees in range(1, n_estimators + 1):
rf.set_params(n_estimators=n_trees)
rf.fit(X_train, y_train)
train_predictions = rf.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)
val_predictions = rf.predict(X_val)
val_accuracy = accuracy_score(y_val, val_predictions)
train_accuracy_scores.append(train_accuracy)
val_accuracy_scores.append(val_accuracy)
print(f'Validation Accuracy Score: {val_accuracy:.4f}')
print(f'training Accuracy Score: {train_accuracy:.4f}')

4. Making Predictions:

The next step is to make predictions on the test data using the trained model. We drop the unnecessary columns (‘PassengerId’, ‘Name’, ‘Ticket’, ‘Cabin’) from the test data and predict the survival outcomes for each passenger.

# Make predictions on the test data
test_predictions = rf.predict(test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1))

5. Saving Predictions:

The next step is to save the predictions to a CSV file. We create a new data frame with the passenger ID and survival outcome and save it to a CSV file.

# Save the predictions to a CSV file
output_df = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Survived': test_predictions})
output_df.to_csv('/content/drive/MyDrive/titanic/submission.csv', index=False)

6. Visualizing Model Performance:

Finally, we visualize the performance of the Random Forest Classifier using various plots such as the accuracy scores for different numbers of trees, feature importances, and the confusion matrix.

RESULTS WERE:

https://www.kaggle.com/code/abhisangharora/notebooka0fc270f1e

Validation Accuracy Score: 0.8582

Training Accuracy Score: 0.9789

# Plot the accuracy scores for different numbers of trees
plt.plot(range(1, n_estimators + 1), train_accuracy_scores, label='Training Accuracy')
plt.plot(range(1, n_estimators + 1), val_accuracy_scores, label='Validation Accuracy')
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy Score')
plt.title('Random Forest Classifier Performance')
plt.legend()
plt.show()



plt.plot(range(1, n_estimators + 1), train_accuracy_scores)
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy Score')
plt.title('Random Forest Classifier Performance')
plt.show()

# Plot the feature importances
feature_importances = rf.feature_importances_
sorted_idx = feature_importances.argsort()

plt.barh(range(X_train.shape[1]), feature_importances[sorted_idx])
plt.yticks(range(X_train.shape[1]), X_train.columns[sorted_idx])
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Random Forest Classifier Feature Importances')
plt.show()

# Plot the confusion matrix
plt.imshow(val_confusion_matrix, cmap='binary', interpolation='None')
plt.colorbar()
plt.xticks([0, 1], ['Not Survived', 'Survived'])
plt.yticks([0, 1], ['Not Survived', 'Survived'])
plt.xlabel('Predicted Class')
plt.ylabel('True Class')
plt.title('Validation Confusion Matrix')
plt.show()
# Plot the accuracy scores for different numbers of trees
plt.plot(range(1, n_estimators + 1), train_accuracy_scores, label='Training Accuracy')
plt.plot(range(1, n_estimators + 1), val_accuracy_scores, label='Validation Accuracy')
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy Score')
plt.title('Random Forest Classifier Performance')
plt.legend()
plt.show()

7. Hyperparameters:

Random Forest Classifier has several hyperparameters that can be tuned to improve its performance. Some of the important hyperparameters used in this code are:

n_estimators: This hyperparameter controls the number of decision trees in the forest. Increasing the number of trees generally improves the accuracy but also increases the computation time. IN THIS CASE, IT IS SET TO 130.

Test_Size: In the code, the test size parameter in the train_test_split function is set to 0.15, which means that 15% of the data will be used for testing and the remaining 85% for training. This is a common technique in machine learning to evaluate the performance of a model on unseen data.The purpose of splitting the data into training and testing sets is to evaluate the model’s ability to generalize to new, unseen data. By training the model on a portion of the data and then testing it on a different subset of the data, we can get an estimate of the model’s performance on new data.

random_state: This hyperparameter sets the random seed for the random number generator. Setting a fixed random seed ensures that the results are reproducible. SET TO 0 IN MY CASE.

Conclusion:

In this blog, we learned how to use Random Forest Classifier for the Titanic dataset using Python. We discussed the various steps involved in the process and also looked at some important hyperparameters used in the code. Random Forest Classifier is a powerful algorithm that can be used for a wide range of classification problems. By understanding its working and tuning the hyperparameters, we can improve the accuracy and stability of our predictions.

References:

https://stackoverflow.com/questions/49147774/what-is-random-state-in-sklearn-model-selection-train-test-split-example#:~:text=The%20random%20state%20is%20simply,want%20the%20same%20set%20again.

--

--

Abhisangh Singh Arora
0 Followers

The smarter you work ,the luckier you get !