MachineHack Airline Price Hackathon

jaswinder singh
Mar 9, 2019 · 5 min read

This article is a simple guide towards preprocessing and building an model to predict airline price for the hackathon

Before we can fit a model according to the data, we will require to preprocess our data

Taking a quick glance at the xlsx file consisting of our training data, we encounter the following features:-

Image for post
Image for post

A simple function ( train.head() )can help us to examine our training data in the working environment

Image for post
Image for post

PreProcessing

Now before we take any action on our data we need to analyse the importance our various features and the tasks we need to accomplish

  • The price has a correlation with the month in which the flight is scheduled. Hence we can extract the information regarding the month using the Date_of_Journey
  • Since we are provided with the information regarding the number of stops ( Total_Stops ) the ( Route ) column becomes redundant, which can be dropped
  • We find that the price has a correlation with the Departure time (Dep_Time) , to map the relation we will split time into two halves ( ‘Morning’ and ‘Evening’)
  • Since we are provided with the Duration of journey we can drop the Arrival_Time from our dataframe
  • Duration is provided to us in “ h m” form , which needs to be transformed into totals minutes of duration
  • Total_Stops is represented in string form, we will need to extract the numerical value
Image for post
Image for post

Now before we start transforming the features according to the requirements we need to accomplish three tasks

  • handle NaNs
  • Separate the feature to be predicted ( Price ) from the training data
  • Concatenate training and test data

Handling NaNs

Image for post
Image for post

As we can see one entry in Route and one entry in Total_stops is missing in our training set, lets us analyze this better by finding the row entries with the corresponding NaNs

Image for post
Image for post

We find that both the missing values actually correspond to a single row. It will be easier for us to drop is row completely

Image for post
Image for post

Separate Price from training data

Image for post
Image for post

Concatenating train and test

The reasoning behind concatenating our train and test dataframe is that, any form of preprocessing or changes we make on our train dataframe need to be implemented on our test dataframe as well, otherwise we cannot use the ML model to predict on the test dataframe.

Eg. Lets assume that our training set has a ‘Indigo’ as one of the airlines but our test data has no such instance, in this scenario if we apply the create_dummy function on the separate dataframes , our test dataframe will lack the Indigo categorical feature and there will be a mismatch in the number of features in the train and test dataframe ( hence we cannot apply the ml model fitted on the train dataframe to our test dataframe )

Image for post
Image for post

Now lets finally manipulate our features

Extracting Month from Date_of_Journey

Image for post
Image for post

Resetting Index

The airline company which is an important feature for us, is actually mentioned as index to our dataframe, hence we need to reset it in order to change it into a column

Image for post
Image for post

Handling Departure time and dropping Arrival time

Image for post
Image for post

Handling Duration

Image for post
Image for post

Handling Total Stops

Image for post
Image for post

Handling typo error in Additional_info

Image for post
Image for post

as seen, The no info is represented both as ‘No Info’ and as ‘No info’. We can easily handle this issue,by replacing one text by another

Dropping Route

Image for post
Image for post

Creating Dummy Variables

After handling all the irregularities of our data we can now finally create dummy variables

Its important to analyze here that we want dummy variable for

  • Airline
  • Date_of_Journey
  • Source
  • Destination
  • Dep_Time
  • Additional_info
Image for post
Image for post

Converting to int type and Separating test and train

Now since we have our dummy features for our categorical columns we can finally change the features of our dataframe to int

Image for post
Image for post

ML MODEL

Hurrah !! we have finally completed all our preprocessing and can now move towards our model building

Image for post
Image for post
Image for post
Image for post

We achieve a 78 score ( according to the contest criteria of evaluation ) and a 90.8 score on submission prediction ( according to the leader board data )

THANK YOU FOR READING THIS ARTICLE

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store