From Data Cleaning to Random Forests: Building a Titanic Survival Predictor
Data analysis is an important aspect of machine learning and can be considered as the foundation of any successful model. In this article, we will discuss a Python code that performs exploratory data analysis (EDA) on the Titanic dataset.
The Titanic dataset is a popular dataset used for predictive modeling. It contains information about the passengers on the Titanic, including their age, sex, ticket class, fare, and survival status. Our code uses pandas to load the dataset into a DataFrame and then performs several operations on it.
You can download the dataset from here.
To begin with, the code uses the read_csv
method of pandas to load the dataset into a DataFrame.
The info
method is then used to display the column names, the number of non-null values, and the data types of each column. This is followed by the describe
method, which displays statistical information about the dataset, including the count, mean, standard deviation, minimum, and maximum values for each column.
The columns
attribute is used to display the column names, and the dtypes
attribute displays the data types of each column. The isnull
method is then used to check for missing values in the DataFrame.
The any
method is used with axis=1
to check for any rows with missing values, and the resulting rows are displayed using indexing.
Next, the value_counts
method is used to count the number of survivors and non-survivors in the dataset, and the resulting dictionary is updated with the labels "dead" and "alive" for readability. The replace
method is then used to replace the numeric values in the "Survived" column with their corresponding labels.
The mean
method is used to calculate the mean age of the passengers, and the interpolate
method is used to fill in missing age values using linear interpolation. The resulting DataFrame is then displayed using the head
method.
The code also uses seaborn and matplotlib to create a bar plot that shows the number of male and female survivors and non-survivors. The value_counts
method is used to count the number of male and female passengers, and the resulting data is used to create a DataFrame. The barplot
method of seaborn is then used to create the plot.
The code also displays the average age of the passengers, as well as the minimum and maximum ages. It also displays the average, minimum, and maximum fares.
The code then calculates the number of passengers who survived and were over the age of 50.
The drop
method is used to remove unnecessary columns from the DataFrame, and the fillna
method is used to fill in any missing values with the mean value of the column. The map
method is used to convert the "Sex" column to numeric values.
The code then uses the train_test_split
method from scikit-learn to split the data into training and testing sets. The RandomForestClassifier
method is used to train a random forest classifier on the training set, and the resulting model is used to make predictions on the testing set. The accuracy_score
method is used to calculate the accuracy of the model, and the resulting score is displayed.
In conclusion, this code provides a comprehensive EDA of the Titanic dataset and demonstrates the use of pandas, seaborn, and scikit-learn to perform data analysis and machine learning tasks. This code can be used as a starting point for anyone interested in exploring the Titanic dataset or performing EDA on other datasets.
You can download the code from here