MachineHack Airline Price Hackathon
This article is a simple guide towards preprocessing and building an model to predict airline price for the hackathon
Before we can fit a model according to the data, we will require to preprocess our data
Taking a quick glance at the xlsx file consisting of our training data, we encounter the following features:-
A simple function ( train.head() )can help us to examine our training data in the working environment
PreProcessing
Now before we take any action on our data we need to analyse the importance our various features and the tasks we need to accomplish
- The price has a correlation with the month in which the flight is scheduled. Hence we can extract the information regarding the month using the Date_of_Journey
- Since we are provided with the information regarding the number of stops ( Total_Stops ) the ( Route ) column becomes redundant, which can be dropped
- We find that the price has a correlation with the Departure time (Dep_Time) , to map the relation we will split time into two halves ( ‘Morning’ and ‘Evening’)
- Since we are provided with the Duration of journey we can drop the Arrival_Time from our dataframe
- Duration is provided to us in “ h m” form , which needs to be transformed into totals minutes of duration
- Total_Stops is represented in string form, we will need to extract the numerical value
Now before we start transforming the features according to the requirements we need to accomplish three tasks
- handle NaNs
- Separate the feature to be predicted ( Price ) from the training data
- Concatenate training and test data
Handling NaNs
As we can see one entry in Route and one entry in Total_stops is missing in our training set, lets us analyze this better by finding the row entries with the corresponding NaNs
We find that both the missing values actually correspond to a single row. It will be easier for us to drop is row completely
Separate Price from training data
Concatenating train and test
The reasoning behind concatenating our train and test dataframe is that, any form of preprocessing or changes we make on our train dataframe need to be implemented on our test dataframe as well, otherwise we cannot use the ML model to predict on the test dataframe.
Eg. Lets assume that our training set has a ‘Indigo’ as one of the airlines but our test data has no such instance, in this scenario if we apply the create_dummy function on the separate dataframes , our test dataframe will lack the Indigo categorical feature and there will be a mismatch in the number of features in the train and test dataframe ( hence we cannot apply the ml model fitted on the train dataframe to our test dataframe )
Now lets finally manipulate our features
Extracting Month from Date_of_Journey
Resetting Index
The airline company which is an important feature for us, is actually mentioned as index to our dataframe, hence we need to reset it in order to change it into a column
Handling Departure time and dropping Arrival time
Handling Duration
Handling Total Stops
Handling typo error in Additional_info
as seen, The no info is represented both as ‘No Info’ and as ‘No info’. We can easily handle this issue,by replacing one text by another
Dropping Route
Creating Dummy Variables
After handling all the irregularities of our data we can now finally create dummy variables
Its important to analyze here that we want dummy variable for
- Airline
- Date_of_Journey
- Source
- Destination
- Dep_Time
- Additional_info
Converting to int type and Separating test and train
Now since we have our dummy features for our categorical columns we can finally change the features of our dataframe to int
ML MODEL
Hurrah !! we have finally completed all our preprocessing and can now move towards our model building
We achieve a 78 score ( according to the contest criteria of evaluation ) and a 90.8 score on submission prediction ( according to the leader board data )
THANK YOU FOR READING THIS ARTICLE