Flight Fare Prediction — Time Series ML Project

Skillcate AI
10 min readAug 28, 2022

--

Classic Time Series Problem — with Flask Deployment

Fare prediction is a classic time series forecasting problem that finds trends in past observations to make future predictions. Many popular flight booking websites today, including Google Flights, showcase intelligent insights on flight fare trends to help user decide what’s the right time to book a flight ticket.

Well, in this tutorial, we are going to do something similar, by building a Flight Fare Prediction App, that takes travel details as input, like: the departure date, arrival date, departure city, arrival city, stoppages, airline carrier; to predict flight ticket price. With this understanding, users may have a better idea of what the cost would be on their upcoming travel.

Watch the video tutorial instead

If you are more of a video person, go ahead and watch this tutorial on YouTube.

FYI, we launch new machine learning projects every week. So, make sure to subscribe to our channel to get access to all of our free ML courses. All project related files are kept on Google Drive & GitHub.

Our Dataset

Our flight fare prediction dataset has 10,682 observations with booking details such as: airline, date of journey, source, destination, route, departure time, arrival time, duration, stoppages, additional info (as applicable), and lastly the price, which is our target variable.

All our features are object variables (basically, string), so we would perform feature engineering to convert these features to their numeric representation.

Our Plan-of-action

  • First up, we load our dataset & perform a series of feature engineering operations to convert our features to numeric representations
  • Then, we do feature selection, using: Sklearn Feature Importance and VIF Multicollinearity, to finalize the features for model training
  • Finally, we train a Random Forest Regressor Model for Flight Fare Prediction, and
  • Finish up with flask deployment to run our app in a live environment

This is our high level project flow. So, at stage 1, we build our Fare prediction machine learning model and then, at stage 2, we build a Web App and deploy it a live environment.

Now, let’s start off with model building.

Model Building — Code Walkthrough

Now when we are on top of our solutioning game, let’s get straight into the machine learning part of our project. All project related files are kept on Google Drive & GitHub. Jupyter Notebook b1_fare_prediction_model.ipynb is the one we are referring to here.

Setting up working environment

So before we do anything, first up, let’s set up our environment where we install the required dependencies.

And then, we mount google drive to access our project drive folder (if you are using Google Colab). Otherwise, use the commented code to set working directory.

Load Dataset

This next section is on loading our dataset. So, let’s do it..

For minor details, like what’s the role of this: set_option, I have provided explanatory comments throughout this code. So, do refer to these comments for in-depth understanding of what’s going on.

Towards the end, we are printing the information on our dataset, including the data types with dataset.info(). All our features are object data types, meaning they have string values in them. Our target variable Price is float.

Missing Value Check

Now, let’s check for missing values.

We have a few missing values, that we go ahead and drop with the dropna function. Finally, we validate if changes are done.

You should see all zeros this time.

Feature Engineering — Continuous Variables

Now we come to the all important Feature Engineering part, where we perform Exploratory Data Analysis on our dataset. In this current form, a computer may not understand these features. So, we need to figure out ways on how we may convert these features to numeric representations.

First up, we pick up the date and time object variables: Date of Journey, Departure Time, Arrival Time and Duration to derive numeric features, using the pandas to_datetime method. So, let’s get started..

For Date of Journey, we extract the journey day and journey month using pandas to_datetime method, to convert object to datetime data type, and then use dt.day and dt.month attributes to extract journey day and journey month, and store them in two new columns within our dataframe. Now you might think, why are we not extracting the year value here? It’s because our entire dataset is from the year 2019. So, it’s not going to add any value even if we extract it. So, we are not. Anyways, once we have derived the day and month into two new columns journey_day and journey_month, we may go ahead and drop the original Date of Journey column, as it is redundant now.

Then we do exactly the same thing for Departure Time and Arrival Time, where we are extracting hour and minute values and creating new columns in our data frame for storing them. And then dropping the original features. So, let’s execute them all..

Now, let’s talk about Duration. This feature is represented in the format xh ym. However, some values also have only hours or only minutes, missing the other half. To fix this, we first standardise the Duration feature format. For this, we loop through Duration, and check if the length (len) of the Duration value split is 2 or not. In case it is not 2, we check if it has the minutes part missing or the hours. And, we add " 0m“ as suffix and “0h “ as prefix, respectively. Finally, we are extract duration_hours and duration_mins from the now standardized duration values and add them to the data frame

This is how our dataset looks now, with new features.

Alright, now we have successfully derived numeric features from our object variables

Feature Engineering — Categorical Variables

Next up, we have Categorical Variables Airline, Source, Destination, Route, Total_Stops and Additional_info.

Categorical variables can further be of two kinds:

  • Nominal Data, where categories do not have an order, e.g., Airlines. We cannot arrange them in some ranking order. For nominal data, we use OneHotEncoding
  • Ordinal Data, where categories have an order, e.g.: Total_Stops. As a user, I always prefer 0 stops the most, then 1, 2 and so on. For Ordinal data, we use LabelEncoder

With this understanding, let’s get started with Feature Engineering on nominal features: Airline, Source & Destination. On these, we shall perform OneHotEncoding..

Let’s talk about Airline first. If we check for the value_counts, we get a bunch of carriers, off which many have low value counts, like: Truejet, Vistara Premium Economy, Jet Airways Business, etc. Take a look..

So, we are going to define a new category, called Other, where we put all these carriers having value count in double digits or less.

For this, first up we first define a new dataframe Airline, and assign the dataset Airline column to it. Then, we loop through our Airline values, replacing low value count carrier names with Other. And finally, we perform OneHotEncoding on the data frame Airline using Pandas get_dummies method.

And with this, we have our Airline dataframe transformed to OneHotEncoding representation.

Now, let’s move to the next feature, which is Source. Here the value counts, which look fairly distributed. So, we may directly perform the OneHotEncoding here. For this, we first define a new data frame Source, and then call pandas get_dummies method on it.

And with this, we have our Source data frame transformed to OneHotEncoding representation.

Next up, we have Destination. Here are the value counts. If you go through these values, you will find a minor issue. We have two labels Delhi and New Delhi, which are basically the same City.

So, we need to merge them into one. For this, we apply a similar logic to what we applied for the Airline feature. Basically, we define a new data frame Destination, having a single column destination from the main data frame, viz., dataset. Then, we loop through Destination and replace all New Delhi with Delhi. And finally, we perform OneHotEncoding.

Now moving on, for the next couple of features Route and Additional_Info, we are actually going to drop them.

  • Route is a redundant feature, as we have Total_Stops that also captures similar information on Stoppages
  • For Additional_Info, 80% of the observations say no_info, so doesn’t add much value.

Now Total_Stops is Ordinal Data, which follows an order. Like zero stops is better than 1 stop than 2 stop, etc. So, for this, we perform LabelEncoding. To do this, first let’s check for value counts. Then, as part of LabelEncoding, we assign keys to Total_Stops categories starting from 0 till 4 and replace the original values with these keys.

Well, with this we have completed the never ending Feature Engineering part. As a final step, let’s concatenate the new data frames that we created: Airline, Source, Destination with the main dataset, and we are now calling it data_train. And separately, we also drop the original Airline, Source and Destination columns from data_train.

And this is the final shape: (10000, 26)

Feature Selection

Now we have 25 features to train our Flight Fare prediction model, if I exclude the Target Variable Price from this 26. However, not all of these 25 features would be of the same value for training. And in fact, few of the features could be highly correlated among themselves, a condition we call multicollinearity.

Now to check for the relative importance of features and for multicollinearity, we are using: sklearn feature_importance_ and Variable Inflation Factor or VIF, respectively.

To proceed further, let’s first define our X and y. X is all the columns barring Price. And Y is the target variable Price. Then, we compute feature importance coefficients for all 25 input features and plot them on a graph.

So, here we have all our features listed in the ascending order of their importance in predicting Price.

Now, let’s also check for multicollinearity, before we make a decision on feature selections. For this, we have written a function calc_vif. Then we call it on our X.

As a rule of thumb, VIF value > 10 is generally a sign of multicollinearity. Here, we are getting high VIF values for journey_month and source_delhi. And if we look at the feature importance on these variables, we find that source_delhi has lesser feature importance than journey_month. So we go ahead and drop source_delhi from our features, by declaring our X once again.

Now, if we compute VIF values again, we may still find some VIF values over 10. On the other hand, their feature importance is high too. Hence, it’s a trade-off we have to make to prevent overfitting at the cost of reduced model performance. So, I leave this part to you on doing further permutations in selecting your final feature set.

Model Training

Now, let’s proceed to Model training. To start off, we split up our dataset into 80:20 training and test. And then, we train our Random Forest Regressor Model on the training set.

Guys, congratulations to you for making it to this point. Do give yourself a pat on the back for training your model, all by yourself.

In this part, we also check for model performance. We are getting a training R2 score of 95% and test set R2 score of 83%, which is decent. Then we plot a scatter plot. In this, we see a good linear relationship between our actual test set prices and the predicted prices. For our regression model, we also compute the Model error values. Here, we have these Mean Absolute Error, Mean Squared Error and Root Mean Squared Error values.

Our normalized RMSE is 0.06. It’s values lie on a scale of 0 to 1, and a value approaching 0 is considered good. So, in our case, the model is actually doing a fair job..

And finally, we go ahead and save our model pkl file back to our Project folder for deployment. With this, we have our Flight Fare Prediction Model ready for deployment. ✌️

Model Deployment

Alright, now we are all set for deployment of our Flight Fare Prediction App. Our GitHub Repo for this Flight Fare Prediction Project has all the project related files we have been discussing till now, plus: the Web App (a.k.a., app.py) & requirements.txt. The instructions on running the Project App is documented in the readme part of the project repo. This is how our Flight Fare Prediction app looks while in action:

Conclusion & Future Scope

Guys, with this, we have come to the end of this tutorial. Do give a thumbs up to this write-up if you liked it. Do try it out on your end and share your experience on how it went, in the comments section below.

As future scope, we may:

  • Perform hyperparameter tuning on the Random Forest Model, to boost performance (I shall be doing a tutorial on this soon),
  • Train our model on global flights data, to predict worldwide flight prices,
  • Further explore feature selection part, to minimise overfitting.

Brief about Skillcate

At Skillcate, I am on a mission to bring you application based machine learning education. I launch new machine learning projects every week. So, make sure to subscribe to my YouTube Channel and also hit that bell icon, so you get notified when our new ML Projects go live. To talk to me, schedule a free 1:1 session on my website skillcate.com.

Shall be back soon with a new ML project. Until then, happy learning 🤗!!

--

--

Skillcate AI

Project-based courses — solving real business problems end-to-end. Book a free 1:1 mentoring session at skillcate.com :)