Flight Price Prediction using different Machine Learning Algorithms.

16 min readJul 8, 2023

Introduction

With the development of transportations industry and everyone has limited time for travelling, the airplane become more popular way of travelling. The advent of internet everywhere, people can easily check, book and buy air ticket any time with the help of internet. Airlines frequently changes prices of flights to maximize profit and fill more seats. Based on the demand for specific route and travel date, flight distances, service categories, airline may increase or decrease the price of ticket. Generally, costumers want to busy ticket at lowest possible price. However, they are still confuse when to buy ticket or not. Especially those people who are not frequently travel in airplane. To overcome this problem, we are used machine learning model to predict flight price accurately. This would be very useful application for us, which can help passengers to make appropriate purchase decision and save their expense.

Data Gathering
Exploratory Data Analysis
Feature Engineering
Feature Selection
Model Creation
Model Evaluation
Conclusion

Data Gathering

Data is the very essential for any kind of data science projects. First of all, we have to collect data regarding our domain. This data collection exercise often requires domain expert. With the help of domain expert we could collect more accurate data. If we have good dataset then machine learning model can learn more accurately and make better prediction. So, more accurate data is required according to our problem statement. The model performance is depends on how the good data will we provide to our model. To do this project I had take data from kaggle.com. The dataset is look like this below.

train_data = pd.read_excel("Data_Train.xlsx")
pd.set_option('display.max_columns',None)
train_data.head()

Exploratory Data Analysis

It is a way of visualizing, summarizing and interpreting the information that is hidden in rows and column format. We need understand the various aspects of data and what lies inside our data. Understanding about features/attributes like how many number of numerical features, categorical features and how they are represent etc. First, We will look at the features of our dataset.

train_data.columns

We have 10 independent features and 1 dependent feature i.e Price is our target variable in dataset. The features are self explanatory.

train_data.datatypes

we can see that their is only two datatypes. They are int64 and object.

int64: It represents the integer feature in dataset.
object: It represents the feature having multiple datatypes including string.

Most of the features are object. which is treated as categorical because it contain string and mathematical notation. The feature Additional Info is unnecessary column because it contain almost 80% of no information and it couldn’t help us to making our conclusion so we can remove this.

train_data.drop(['Additiona_Info'],axis=1,inplace=True)
train_data.head()

In EDA, we are used different statistical tools for visualizing the data, which help us to reveal hidden information and maximize insight into dataset. We can find the outlier and the relation between independent feature and dependent feature in data. Now, we can see the relation between Airline and price using cat plot.

# Airline vs Price
sns.catplot(y = "Price", x = "Airline", data = train_data.sort_values(
"Price", ascending = False), kind="boxen", height = 6, aspect = 3)
plt.show()

we can easily see the relation between Airline and price. Jet Airways Business have highest price that means it has more impact on price. Jet Airways also have highest price after Jet Airways Business that means it has more impact on price but less impact than Jet Airways Business. Trujet have lowest price so it has less impact on price. The relation between Price and Source using catplot.

# Price vs Source
sns.catplot(y = "Price", x = "Source",data= train_data.sort_values(
  "Price", ascending=False), kind = "boxen", height = 4, aspect = 3)
plt.show()

The Source from banglore have highest price and it has more significant impact on price and source from chennai have less prices so it has less significant impact on price.

Feature Engineering

These are the list of feature engineering techniques.

Handling Missing values
Handling Outliers
Log Transform
Extracting Date
One hot Encoding
Feature scaling
Splitting the data for train and test

Handling Missing values

Missing values are one of the most common problem we can face when we are try to prepare our data for machine learning. The reason for the missing values might be human errors, interruptions in the data flow, privacy concerns, and so on. Whatever is the reason, missing values affect the performance of the machine learning models. The most simple solution for missing value is drop the entire rows and columns but it is not good solution because loss of information and data. We have to Works poorly if the percentage of missing values is high (say 70%), compared to the whole dataset then we can remove it. We can also use mean, median and mode to replace missing values.

#looking for Null Values
train_data.isnull().sum()

Handling Outliers

We can detect outlier with the help of mathematical method by using percentiles. We need to assume certain percent of the value from top or bottom as a outlier. Additionally, one common mistake while using percentiles top 5% is not the values between 96 and 100. Top 5% means here the values that are out of the 95th percentile of data. Another option for handling outliers is to cap them instead of dropping. So you can keep your data size and at the end of the day, it might be better for the final model performance.

Log Transform

Logarithm transformation is one of the most commonly used mathematical transformations in feature engineering. There are some benefits of log transform. It helps to handle skewed (left skew and right skew) data and after transformation, the distribution becomes more approximate to normal distribution/Gaussian distribution.It also decreases the effect of the outliers, due to the normalization of magnitude differences and the model become more robust.

import scipy.stats as stats
def diagnostic_plot(train_data, variable):
# funtion to plot histogram and Q-Q plot
# side by side, for certain variable
    plt.figure(figsize=(16,9))
    plt.subplot(1,2,1)
    train_data[variable].hist()

    plt.subplot(1,2,2)
    stats.probplot(train_data[variable],dist='norm',plot=plt)
    plt.show()

    train_data['Log_Price'] = np.log(train_data['price']+1)
    diagnostic_plot(train_data, 'Log_Price')

Extracting Date

Date columns usually provide very useful information about the model target. However, dates can be present in numerous formats, which make it hard to understand by algorithms, even they are simplified to a format like “24/03/2019”.so, we have to use different type of preprocessing for dates.

Extracting the parts of the date into different columns: Year, month, day, etc.
Extracting the time period between the current date and columns in terms of years, months, days, etc.
Extracting some specific features from the date: Name of the weekday, Weekend or not, holiday or not, etc.

Date_of_Journey:

In the column ‘Date_of_Journey’, we can see the date format is given as dd/mm/yyyy and as you can see the datatype is given as object So there is two ways to tackle this column, either convert the column into Timestamp or divide the column into date, Month ,Year. Here , i am splitting the columns.

train_data['Journey_day'] = pd.to_datetime(train_data['Date_of_Journey'], format = '%d/%m/%Y').dt.day
train_data['Journey_month'] = pd.to_datetime(train_data['Date_of_Journey'], format = '%d/%m/%Y').dt.month
train_data.drop(['Date_of_Journey'],axis=1,inplace=True)

Arrival_Time:

In the column ‘Arrival_Time’, if we see we have combination of both time and month but we need only the time details out of it so we split the time into ‘Hours’ and ‘Minute’.

# Arrival time is when the plane pulls up to the gate.
# Similar to Date_of_Journey we can extract values from Arrival_Time
train_data['Arrival_hour'] = pd.to_datetime(train_data['Arrival_Time']).dt.hour
train_data['Arrival_minute'] = pd.to_datetime(train_data['Arrival_Time']).dt.minute
train_data.drop(['Arrival_Time'],axis=1,inplace=True)

Dep_Time:

As same as ‘Arrival_time’ .we split this column also in hour and minute.

# Departure time is when a plane leaves the gate.
# Similar to Date_of_Journey we can extract values from Dep_Time
train_data['Dep_hour'] = pd.to_datetime(train_data['Dep_Time']).dt.hour
train_data['Dep_min'] = pd.to_datetime(train_data['Dep_Time']).dt.minute

Dummy Variable Encoding

In our dataset, there may be some categorical features that don’t understand by computer so we have to convert those categorical features into numerical feature. Categorical features contain string data type. There are two types of categorical feature in our dataset namely ordinal and nominal. Nominal data are those data which don’t have priority ordering and Ordinal data are those data that has priority ordering with each variable. There are many ways to handle categorical features. In this case, we use dummy variable encoding. Dummy variable encoding represents n categories with n-1 binary value. First of all, we are going convert nominal data types. There are three categorical features in our dataset, which are nominal data types. i.e Airline, Source, Destination.

## As Airline is Nominal Categorical data we will perform OneHotEncoding
Airline = train_data['Airline']
Airline = pd.get_dummies(Airline, drop_first=True)
Airline.head()

For each level of a categorical feature, we create a new variable. Each category is mapped with a binary variable containing The problem with the one hot encoding technique is that it creates redundancy because it. The problem with the one hot encoding technique is that it creates redundancy because it creates an additional feature based on categories in the feature. Tree based models doesn’t perform well with one hot encoding if there are too many unique value in feature. This is because they pick the subset of feature while splitting the data. If we have a lot of unique value, then the chosen features will be mostly zero which doesn’t produce significant result. So we don’t perform one hot encoding while training tree based models. Linear models don’t suffer from this problem.

Source = train_data['Source']
Source = pd.get_dummies(Source, drop_first=True)
Source.head()

Destination = train_data['Destination']
Destination = pd.get_dummies(Destination, drop_first=True)
Destination.head()

Ordinal data are those data that has priority ordering with each variable. An ordinal encoding involves mapping each unique label to an integer value. This type of encoding is really only appropriate if there is a known relationship between the categories. In Label encoding, each label (category) is converted into an integer value. Total_stops is ordinal variable. The route and total step are related. The ‘Route’ columns mainly tell us that how many cities they had taken to reach from source to destination .This column is very important because based on the route they took will directly effect on the price of the flight.

# As this is case of Ordinal Categorical type we perform LabelEncoder
# Here Values are assigned with corresponding keys

train_data.replace({'non-stop':0, '1 stop': 1, '2 stops':2, '3 stops': 3, '4 stops': 4}, inplace=True)
train_data.head()

Feature Scaling

In most cases, the numerical features of the dataset do not have same range and they differ from each other. So, we need to scale our feature. This process is not mandatory for many algorithms like tree base algorithms, but it might be still nice to apply. However, the algorithms based on distance calculations such as k-NN or K-means, need to have scaled continuous features as model input.

Standardization If the standard deviation of features is different, their range also would differ from each other. This reduces the effect of the outliers in the features. In the following formula of standardization, the mean is shown as μ and the standard deviation is shown as σ. z = (x-μ)/σ
Normalization Normalization (or min-max scaler) scales all values in a fixed range between 0 and 1. This transformation does not change the distribution of the feature and due to the decreased standard deviations, the effects of the outliers increases. Therefore, before normalization, it is recommended to handle the outliers. Xnorm = (X-Xmin)/(Xmax-Xmin).

from sklearn.preprocessing import RobustScaler
sc = RobustScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

Splitting the data

The fundamental goal of an ML model is to make accurate predictions on future data instances beyond those used to train models. Before using an ML model to make predictions, we need to evaluate the predictive performance of the model. To estimate the quality of an ML models predictions with data it has not seen, so we have to split our data into training and testing data. This allows us to evaluate our model on unseen data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state = 42)

Feature Selection

Feature Selection is the process to automatically select those features which contribute most to our model. Having irrelevant features in our data don’t increase the accuracy of our model. We want to remove those features which doesn’t have significant effect. We can use seaborn library to plot heat map of study the correlation between features.

# Finds correlation between Independent and dependent attributes
plt.figure(figsize = (18,18))
sns.heatmap(train_data.corr(), annot = True, cmap = "RdYlGn")
plt.show()

We see our most of the features are not highly correlated with each others. Two features Number of Total_stops and Price are correlated with each other. Also, The total stops and Duration is highly correlated. Instead, if two features are perfectly correlated, then one doesn’t add any additional information. So removing either of the features don’t affect the accuracy of model. Now, we are using Extra Tree Regressor for feature selection and these are the top 10 important features from dataset.

# Important feature using ExtraTreesRegressor

from sklearn.ensemble import ExtraTreesRegressor
selection = ExtraTreesRegressor()
selection.fit(X, y)
print(selection.feature_importances_)

feat_importances = pd.Series(selection.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

Total steps is one of the most important feature and which highly correlated with the Price.

Model Creation

Choosing the right algorithm is also a important phase of data science project. Since our project is a regression problem we can choose variety of regression algorithm to make prediction. Some of the algorithms we have used are listed below:

Linear regression
K Nearest neighbors
Random Forest
Decision tree
Xg boost regressor

Linear regression

Linear Regression is the machine learning algorithm based on supervised learning. Regression models a target prediction value based on independent variables. It is mostly used for find out the relationship between variables and forecasting. Different regression models differ based on kind of relationship between dependent and independent variables. Linear Regression performs task to predict the dependent variable value (y) based on a given independent variables(x).So, this regression technique finds out a linear relationship between x (input) and y (output). Hence, the name is Linear Regression.

In the figure above, X (input) is the work experience and Y (output) is the salary of a person. The straight line in the diagram is the best fit line. The main goal of the simple linear regression is to consider the given data points and plot the best fit line to fit the model in the best way possible.

K Nearest neighbors

The KNN algorithm is simple, easy to implement supervised machine learning algorithm that can be used to both classification and regression problems. KNN is also a non-parametric learning algorithm because it doesn’t have any assumption about the data. The KNN algorithm uses the feature similarity to predict the value of any new data points. It computes the distance between the new data points and every training data points using Euclidean distance. Then, select the k closest instances and their labels/class. In output, those label is most frequently occur and the new data point will be belong into that class. Below is the figure how KNN algorithm works.

The KNN takes less amount of time while training and training time is constant. It takes more amount of time while testing. However, testing time is depends on the amount of dataset. This algorithm can be computationally expensive because we need to store all training example and compute distance to all training example for a single prediction.

Random Forest

Random Forest is the ensemble learning method which is used for both classification as well as regression. It is the one of the most used algorithm because of high accuracy. The main idea of random forest is it builds the multiple decision trees and merges them together to get more accurate and better prediction. First of all, random forest selects the samples randomly from a given dataset. Next, this algorithm will construct the decision tree for every sample. Then, it will get the result from every decision tree. Now, voting will be performed for every predicted result. At last, select the most voted prediction result as the final prediction result.

In Decision Tree, the over-fitting is main problem which is solved by using random forest. Instead of single decision tree, it creates multiple trees and it reduces the over-fitting by averaging/combining the result.It also work well with large amount of dataset but the complexity is very high. More computational resources are required to implement Random Forest algorithm.

Decision Tree

Decision Trees are non-parametric supervised learning method used for both classification and regression. Decision Tree builds classification and regression models in the form of tree structure. Decision tree handle the both categorical and numerical data. It iteratively divides attributes into two groups which are the most dominant attribute and others to construct a tree. Then, it calculates the entropy and information gain of each attribute. In this way, the most dominant attribute can be found. After then, the most dominant one is put on tree as decision tree (root node). Entropy and gain score would be again calculated among the other attributes. This procedure will continue until a reaching for that branch.

Decision Tree is look like a simple if else statement. It can handle non linear parameter efficiently and it has so many benefits. However, the main problem with Decision tree is over-fitting because it creates tree completely to the depth. It doesn’t work well with the large amount of dataset.

Xg boost regressor

XGBoost is a supervised machine learning algorithm that stands for Extreme Gradient Boosting. It is used for both classification and regression problems. It uses the binary tree while constructing the tree and it is highly efficient algorithm. XGBoost is one of the popular algorithms for competition because of its performance. It used the boosting technique which helps to reduce the error made by previous models by adding some weights to the models. It supports the parallel processing and can be run on both single and distributed system like Spark and Hadoop. Its working mechanism is similar like a gradient boosting. The base algorithm reads the data and assigns equal weight to each sample observation. False predictions are assigned to the next base learner with a higher weightage on these incorrect predictions. Repeat unless and until the algorithm can correctly classify the output.

Model Evaluation

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors: In equation form, it looks like this

Mean Squared Error (MSE) is the mean of the squared errors:. In equation form, it looks like this:

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

comparing these metrics:

MAE is the easiest to understand, because it’s the average error.
MSE is more popular than MAE, because MSE “punishes” larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE, because RMSE is interpretable in the “y” units. All of these are loss functions, because we want to minimize them.

R-Squared

R-Squared is the statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination or coefficient of multiple determination for multiple regression. R-squared comes with an inherent problem – additional input variables will make the R-squared stay the same or increase (this is due to how the R-squared is calculated mathematically). Therefore, even if the additional input variables show no relationship with the output variables, the R-squared will increase.

Adjusted R-squared R-squared measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model. Adjusted R-squared adjusts the statistic based on the number of independent variables in the model.R2 shows how well terms (data points) fit a curve or line. Adjusted R2 also indicates how well terms fit a curve or line, but adjusts for the number of terms in a model. If you add more and more useless variables to a model, adjusted r-squared will decrease. If you add more useful variables, adjusted r-squared will increase.

Evaluation Metrics for Linear Regression

from sklearn import metrics
print("MAE", metrics.mean_absolute_error(y_test,y_pred))
print("MSE", metrics.mean_squared_error(y_test,y_pred))
print("RMSE", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

The R-Squared Score is 62%

Evaluation metrics for KNeighborsRegressor

from sklearn import metrics
print("MAE", metrics.mean_absolute_error(y_test,y_pred))
print("MSE", metrics.mean_squared_error(y_test,y_pred))
print("RMSE", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

The R-Squared Score is 47%

Evaluation metrics for Random Forest regressor

from sklearn import metrics
print("MAE", metrics.mean_absolute_error(y_test,y_pred))
print("MSE", metrics.mean_squared_error(y_test,y_pred))
print("RMSE", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

The R-Squared Score is 81%

Evaluation metrics for Decision Tree regressor

from sklearn import metrics
print("MAE", metrics.mean_absolute_error(y_test,y_pred))
print("MSE", metrics.mean_squared_error(y_test,y_pred))
print("RMSE", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

The R-Squared Score is 74%

Evaluation metrics for Xgboost regressor

from sklearn import metrics
print("MAE", metrics.mean_absolute_error(y_test,y_pred))
print("MSE", metrics.mean_squared_error(y_test,y_pred))
print("RMSE", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

The R-Squared Score is 84%

Conclusion

In this article, a Machine Learning model is developed to predict the airlines fare. In this type of problem Feature Engineering is the most crucial think . You can see how we have handled the categorical and numerical data and also how we build different ML model on the same dataset . We also check the RMSE score of each model so that we can understand how it should perform in our test dataset. With the help of the above techniques, proposed model is able to predict the flight fare with R squared score of 84.59%. However, there is still ways to do improvement in this model. In the future, our model can be predict the flight fare more accurately, if we get some of information such as seat location, when ticket was booked, special occasion on departure date etc.

Links

The code for the project can be found Here

The original dataset can be found Here