MachineHack Airline Price Hackathon

This article is a simple guide towards preprocessing and building an model to predict airline price for the hackathon

Before we can fit a model according to the data, we will require to preprocess our data

Taking a quick glance at the xlsx file consisting of our training data, we encounter the following features:-

A simple function ( train.head() )can help us to examine our training data in the working environment


Now before we take any action on our data we need to analyse the importance our various features and the tasks we need to accomplish

  • The price has a correlation with the month in which the flight is scheduled. Hence we can extract the information regarding the month using the Date_of_Journey
  • Since we are provided with the information regarding the number of stops ( Total_Stops ) the ( Route ) column becomes redundant, which can be dropped
  • We find that the price has a correlation with the Departure time (Dep_Time) , to map the relation we will split time into two halves ( ‘Morning’ and ‘Evening’)
  • Since we are provided with the Duration of journey we can drop the Arrival_Time from our dataframe
  • Duration is provided to us in “ h m” form , which needs to be transformed into totals minutes of duration
  • Total_Stops is represented in string form, we will need to extract the numerical value

Now before we start transforming the features according to the requirements we need to accomplish three tasks

  • handle NaNs
  • Separate the feature to be predicted ( Price ) from the training data
  • Concatenate training and test data

Handling NaNs

As we can see one entry in Route and one entry in Total_stops is missing in our training set, lets us analyze this better by finding the row entries with the corresponding NaNs

We find that both the missing values actually correspond to a single row. It will be easier for us to drop is row completely

Separate Price from training data

Concatenating train and test

The reasoning behind concatenating our train and test dataframe is that, any form of preprocessing or changes we make on our train dataframe need to be implemented on our test dataframe as well, otherwise we cannot use the ML model to predict on the test dataframe.

Eg. Lets assume that our training set has a ‘Indigo’ as one of the airlines but our test data has no such instance, in this scenario if we apply the create_dummy function on the separate dataframes , our test dataframe will lack the Indigo categorical feature and there will be a mismatch in the number of features in the train and test dataframe ( hence we cannot apply the ml model fitted on the train dataframe to our test dataframe )

Now lets finally manipulate our features

Extracting Month from Date_of_Journey

Resetting Index

The airline company which is an important feature for us, is actually mentioned as index to our dataframe, hence we need to reset it in order to change it into a column

Handling Departure time and dropping Arrival time

Handling Duration

Handling Total Stops

Handling typo error in Additional_info

as seen, The no info is represented both as ‘No Info’ and as ‘No info’. We can easily handle this issue,by replacing one text by another

Dropping Route

Creating Dummy Variables

After handling all the irregularities of our data we can now finally create dummy variables

Its important to analyze here that we want dummy variable for

  • Airline
  • Date_of_Journey
  • Source
  • Destination
  • Dep_Time
  • Additional_info

Converting to int type and Separating test and train

Now since we have our dummy features for our categorical columns we can finally change the features of our dataframe to int


Hurrah !! we have finally completed all our preprocessing and can now move towards our model building

We achieve a 78 score ( according to the contest criteria of evaluation ) and a 90.8 score on submission prediction ( according to the leader board data )