Analysis & Passenger Satisfaction Prediction on US Airline : Classification Method
Hello Everyone!! In this moment, I wanna share about my analysis data set, that I get on kaggle. In my last stories, I explain about regression method to do price prediction residential of Washington DC. Now, the data set about survey of Passenger Satisfaction in US airline. Generally, classification method is when the output is a category. For example in this case, from the survey has been done, how do customers feel “satisfied” or “neutral or dissatisfied”. For more detail, let’s see what’s in this data set!
Explanation About Data
This data set have a 64940 rows with 24 columns. I’ll try to cleaning the data for fill column nan value at “Arrival Delay in Minutes”. Probably, I can use all features for my prediction. But, I can take conclusion what are the features that will be used on the heat map in Exploratory Data Analysis after this part.
RangeIndex: 64940 entries, 0 to 64939
Data columns (total 24 columns):
id 64940 non-null int64
satisfaction_v2 64940 non-null object
Gender 64940 non-null object
Customer Type 64940 non-null object
Age 64940 non-null int64
Type of Travel 64940 non-null object
Class 64940 non-null object
Flight Distance 64940 non-null int64
Seat comfort 64940 non-null int64
Departure/Arrival time convenient 64940 non-null int64
Food and drink 64940 non-null int64
Gate location 64940 non-null int64
Inflight wifi service 64940 non-null int64
Inflight entertainment 64940 non-null int64
Online support 64940 non-null int64
Ease of Online booking 64940 non-null int64
On-board service 64940 non-null int64
Leg room service 64940 non-null int64
Baggage handling 64940 non-null int64
Checkin service 64940 non-null int64
Cleanliness 64940 non-null int64
Online boarding 64940 non-null int64
Departure Delay in Minutes 64940 non-null int64
Arrival Delay in Minutes 64755 non-null float64
dtypes: float64(1), int64(18), object(5)
memory usage: 11.9+ MB
This is explanation about each columns in this raw data set.
- Satisfaction : Airline satisfaction level(Satisfaction, neutral or dissatisfaction).
- Age : The actual age of the passengers.
- Gender : Gender of the passengers (Female, Male).
- Type of Travel : Purpose of the flight of the passengers (Personal Travel, Business Travel).
- Class : Travel class in the plane of the passengers (Business, Eco, Eco Plus).
- Customer Type : The customer type (Loyal customer, disloyal customer).
- Flight Distance : The flight distance of this journey.
- Inflight Wifi Service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5).
- Ease of Online Booking : Satisfaction level of online booking
- Inflight Service : Satisfaction level of inflight service.
- Online Boarding : Satisfaction level of inflight service.
- Inflight Entertainment: Satisfaction level of inflight entertainment.
- Food and drink: Satisfaction level of Food and drink.
- Seat comfort: Satisfaction level of Seat comfort.
- On-board service: Satisfaction level of On-board service.
- Leg room service: Satisfaction level of Leg room service.
- Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient.
- Baggage handling: Satisfaction level of baggage handling.
- Gate location: Satisfaction level of Gate location.
- Cleanliness: Satisfaction level of Cleanliness.
- Check-in service: Satisfaction level of Check-in service.
- Departure Delay in Minutes: Minutes delayed when departure.
- Arrival Delay in Minutes: Minutes delayed when Arrival.
First, I check nan values on this data set is column “Arrival Delay in Minutes”. I’ll try to fill nan values by Departure Delay in Minutes by type of class.
Exploratory Data Analysis
Next step, after data preprocessing. I have a clean data set and ready for analsysis and predict with the features and target. First, I wanna present about distribution from customers located at the airport.
Customer Count by Class & Age
Travel class in the airport of the passengers divided by 3 class that is Business, Economy Plus & Economy. Range of age in this data start from 7 to 85 years. Travel class dominated by business class (47%) & economy class (45%), and the rest are economy class from the data of the passengers.
From all graph we can conclude each class has its own market share based on the age of the passengers. We can give a promote, discount package or probably free ticket with some aviation that has been done for each class with its own market share. It can be share and discuss with marketing department.
What travel class is used by the loyal customer?
Customer type in this airport divided by 2 type that is loyal customer & disloyal customer. From the data customer type dominated by loyal customer (82%), and for disloyal customer (18%).
Count Type of Travel by Class
Type of travel in this airport meaning purpose of the flight of the passengers. Type of travel divided by 2 type that called business travel & personal travel. From the data Type of travel dominated by business travel (69%), whereas personal travel (31%).
Correlation for each Column
For better present the correlation for each column. I used heatmap graph, “heatmap is a graphical representation of data that uses a system of color-coding to represent different values.” — optimizely.com. The stronger the color, the fewest the correlation magnitude, in this graph is dark violet. Whereas The brighter the color, is the larger the correlation magnitude, in this graph is light cream.
From this heatmap, I wanna used features with correlation more than 0.1 with this target is satisfaction from the passengers.
The last step, for making the model in this data set, I will trying use several classification models namely Logistic Regression, Decision Tree Classifier, Random Forest Classifier and Gradient Boosting Classifier.The step for modeling.
- Find best algorithm for predict satisfaction of the passengers.
- Prediction for the best model with hyperparameter.
- Evaluation with recall and precision.
Target and Features
This is the target (y) and features (x) for this modelling. I got features from heatmap that I made in exploratory data analysis.
- Target is satisfaction of passengers.
- Features (Gender, Customer Type, Age, Type of Travel, Class,
Seat comfort, Food and drink, In flight wifi service, In flight entertainment, Online support, Ease of Online booking,
On-board service, Leg room service, Baggage handling,
Check in service, Cleanliness, Online boarding).
Meaning for feature engineering in generally is change categorical value (in column categorical) to numerical value.
Find Best Algorithm
After the data processing with feature engineering, the data ready to be used for modeling. I’ll try compare four algorithm, to find who the best for predict this data.
From all algorithm that I used, the best recall and precision is Random Forest Classifier. I try to compare data test and data train, to detect whether this model is underfitting or overfitting. Finally the result is very good, we have a good score on both. Before, I explain about hyperparameter for this algorithm. I will discuss little bit about recall and precision.
- True positives: data points labeled as positive that are actually positive.
- False positives: data points labeled as positive that are actually negative.
- True negatives: data points labeled as negative that are actually negative.
- False negatives: data points labeled as negative that are actually positive.
Recall and Precision Metrics
- Recall: ability of a classification model to identify all relevant instances.
- Precision: ability of a classification model to return only relevant instances.
- F1 score: single metric that combines recall and precision using the harmonic mean.
So, we can conclude the algorithm have score to predict (actually right predict) this data with 93% to detect “neutral or dissatisfied” & 96% to detect “satisfied”. Precision and recall it’s extremely important for classification method. While precision refers to the percentage of our results which are relevant, recall refers to the percentage of total relevant results correctly classified by our algorithm. Unfortunately, some times we can’t maximize the score of them. For the other case, probably trade-off is needed. For make a decision whether to maximize recall score or precision.
Find Best Hyperparamter
So, I will try find the best hyperparameter for Random Forest Classifier (the best algorithm to modeling) using GridSearchCV. The result hyperparameter for the model (RandomForestClassifier) with max_depth : 25, min_samples_leaf : 0.000001, min_samples_split : 0.000001.
Next step after find best hyperparameter, input best hyperparamter to the model for optimization the prediction of passengers satisfied. So, let’s start making model with best hyperparameter.
Conclusion from the modeling with this data it’s very good with precision, recall & f1-score almost 100% compare with the actual data. Some parameters not included on this model, you can find for more better score with another parameters.
That is all from me, I hope you can take the insight of this data. There are still many mistakes and shortcomings in every model that I do. For more detail about this data, the code, and more visualize you can reach my github by following this link https://github.com/Anugrahn. Feel free to ask, and lets start discuss guys!
Thank you, I hope you enjoy it guys. See you on the next stories. Have a nice day! :)
- Data set Passenger Satisfaction from kaggle.com
- Original documentation from scikitlearn
- Definition about precision, recall, f1-score from https://towardsdatascience.com/