From Data Cleaning to Random Forests: Building a Titanic Survival Predictor

Shivam Maurya
4 min readFeb 14, 2023

Data analysis is an important aspect of machine learning and can be considered as the foundation of any successful model. In this article, we will discuss a Python code that performs exploratory data analysis (EDA) on the Titanic dataset.

The Titanic dataset is a popular dataset used for predictive modeling. It contains information about the passengers on the Titanic, including their age, sex, ticket class, fare, and survival status. Our code uses pandas to load the dataset into a DataFrame and then performs several operations on it.

You can download the dataset from here.

To begin with, the code uses the read_csv method of pandas to load the dataset into a DataFrame.

Screenshot by Author

The info method is then used to display the column names, the number of non-null values, and the data types of each column. This is followed by the describe method, which displays statistical information about the dataset, including the count, mean, standard deviation, minimum, and maximum values for each column.

Screenshot by Author

The columns attribute is used to display the column names, and the dtypes attribute displays the data types of each column. The isnull method is then used to check for missing values in the DataFrame.

Screenshot by Author

The any method is used with axis=1 to check for any rows with missing values, and the resulting rows are displayed using indexing.

Next, the value_counts method is used to count the number of survivors and non-survivors in the dataset, and the resulting dictionary is updated with the labels "dead" and "alive" for readability. The replace method is then used to replace the numeric values in the "Survived" column with their corresponding labels.

Screenshot by Author

The mean method is used to calculate the mean age of the passengers, and the interpolate method is used to fill in missing age values using linear interpolation. The resulting DataFrame is then displayed using the head method.

Screenshot by Author

The code also uses seaborn and matplotlib to create a bar plot that shows the number of male and female survivors and non-survivors. The value_counts method is used to count the number of male and female passengers, and the resulting data is used to create a DataFrame. The barplot method of seaborn is then used to create the plot.

Screenshot by Author

The code also displays the average age of the passengers, as well as the minimum and maximum ages. It also displays the average, minimum, and maximum fares.

Screenshot by Author

The code then calculates the number of passengers who survived and were over the age of 50.

Screenshot by Author

The drop method is used to remove unnecessary columns from the DataFrame, and the fillna method is used to fill in any missing values with the mean value of the column. The map method is used to convert the "Sex" column to numeric values.

Screenshot by Author

The code then uses the train_test_split method from scikit-learn to split the data into training and testing sets. The RandomForestClassifier method is used to train a random forest classifier on the training set, and the resulting model is used to make predictions on the testing set. The accuracy_score method is used to calculate the accuracy of the model, and the resulting score is displayed.

Screenshot by Author

In conclusion, this code provides a comprehensive EDA of the Titanic dataset and demonstrates the use of pandas, seaborn, and scikit-learn to perform data analysis and machine learning tasks. This code can be used as a starting point for anyone interested in exploring the Titanic dataset or performing EDA on other datasets.

You can download the code from here

--

--

Shivam Maurya

Data Scientist | Founder & CEO | Student | Mentor | Programmer | Python Professional Coder