Flight Ticket Price Prediction with Machine Learning and EDA

Aishwarya Patnaik
5 min readAug 7, 2023

--

🌐✈️ Traveling on a budget? Explore the magic of Machine Learning! 🚀 Discover how we predict flight ticket prices using 13,354 records from Kaggle. 📊 With a powerful Random Forest Regressor model, savvy travelers can make informed decisions and snag the ultimate deals! 🎫✨

Buckle up and embark on budget-friendly adventures! 🌍🧳

Get the Kaggle dataset used here

“Charting Flights: Navigating Ticket Prices to New Horizons 🛩️”

1. Importing Libraries and Loading the Data

To begin our journey, we load the necessary libraries and board our dataset, which contains vital information such as the airline, source, destination, journey date, total stops, and ticket prices.

# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV

# Load the flight fare dataset
Train_data = pd.read_excel("data/Data_Train.xlsx")
Test_data = pd.read_excel("data/Test_set.xlsx")
flight_df = Train_data.append(Test_data, sort=False)

2. Feature Engineering: Preparing for Takeoff

Smooth travel requires preparation, so we extract journey date and month, handle total stops, and convert flight duration into minutes. Additionally, we optimize categorical data using Label Encoder.

Extracting Journey Date and Month

# Extract day and month from Date_of_Journey feature
flight_df["Journey_date"] = flight_df["Date_of_Journey"].str.split("/").str[0].astype(int)
flight_df["Journey_month"] = flight_df["Date_of_Journey"].str.split("/").str[1].astype(int)

# Now Date_of_Journey column is no longer required, so we can drop it.
flight_df = flight_df.drop(["Date_of_Journey"], axis=1)

Handling Total Stops

# Total_Stops column contains both numeric and non-numeric values, so we extract numeric values.
flight_df["Total_Stops"] = flight_df["Total_Stops"].str.split(" ").str[0]
flight_df["Total_Stops"] = flight_df["Total_Stops"].replace("non-stop", "0")

Extracting Duration in Minutes

# Extracting hours and minutes from Duration
flight_df["Duration_hr"] = flight_df["Duration"].str.split(' ').str[0].str.split('h').str[0]
flight_df["Duration_min"] = flight_df["Duration"].str.split(' ').str[1].str.split('m').str[0]

# Some entries have minutes in the 'Duration_hr' column. We fix this by transferring them to the 'Duration_min' column.
flight_df.loc[flight_df['Duration_hr'].str.contains('m'), 'Duration_min'] = flight_df.loc[flight_df['Duration_hr'].str.contains('m'), 'Duration_hr']
flight_df["Duration_hr"] = flight_df["Duration_hr"].replace("5m", "0")

# Convert the extracted duration values to integers
flight_df["Duration_hr"] = flight_df["Duration_hr"].astype(int)
flight_df["Duration_min"] = flight_df["Duration_min"].astype(int)

# Convert the duration to a single column in minutes
flight_df["Duration"] = (flight_df["Duration_hr"] * 60) + flight_df["Duration_min"]
flight_df = flight_df.drop(['Duration_hr', 'Duration_min'], axis=1)

Handling Categorical Data

# Apply Label Encoder to handle categorical features
la = LabelEncoder()
for i in ["Airline", "Source", "Destination"]:
flight_df[i] = la.fit_transform(flight_df[i])

3. Exploratory Data Analysis (EDA): Unlocking Insider Insights

Before takeoff, we unveil fascinating insights through EDA. We explore correlations between airlines and ticket prices, compare prices between weekdays and weekends, and analyze feature relationships through a heatmap.

a. Airlines Vs Flight Ticket Price

airlines = flight_df.groupby('Airline').Price.max()
airlines_df = airlines.to_frame().sort_values('Price', ascending=False)[0:10]

plt.subplots(figsize=(8, 4))
sns.barplot(x=airlines_df.index, y=airlines_df["Price"], ec="black")
plt.title("Airlines Company vs Flight Ticket Price")
plt.ylabel("Flight Ticket Price")
plt.xlabel("Airlines")
plt.xticks(rotation=90)
plt.show()

Insights

“Jet Airways Business” tickets are the most expensive ones.

b. Price on Weekdays vs Weekends

days_df = flight_df[['Airline', 'Date_of_Journey', 'Price']].copy()
days_df['Date_of_Journey'] = pd.to_datetime(days_df['Date_of_Journey'], format='%d/%m/%Y')
days_df['Weekday'] = days_df['Date_of_Journey'].dt.day_name()
days_df['Weekend'] = days_df['Weekday'].apply(lambda day: 1 if day == 'Sunday' else 0)

plt.subplots(figsize=(8, 4))
sns.barplot(data=days_df, x='Airline', y='Price', hue='Weekend')
plt.xlabel("Airline")
plt.xticks(rotation=90)
plt.ylabel("Price")
plt.title("Price on Weekdays Vs Price on Weekends")
plt.legend(title='Weekend')
plt.ylim(0, 65000)
plt.show()

Insights

  • The Price of tickets is higher on Weekends.

c. Bar chart showing top 10 most preferred Airlines

plt.figure(figsize=(8,4))
sns.countplot(x="Airline", data=flight_df,order = flight_df['Airline'].value_counts().index,ec = "black")
font_style={'family':'times new roman','size':20,'color':'black'}
plt.title("Most preferred Airlines",fontdict=font_style)
plt.ylabel("Count",fontdict=font_style)
plt.xlabel("Airlines",fontdict=font_style)
plt.xticks(rotation= 90)
plt.xlim(-1,10.5)
plt.show()
Most Preferred Airlines

Insights

. Most preferred Airline is “Jet Airways”

. From all the total flight tickets sold, Jet Airways has the highest share followed by Indigo.

d. Heatmap for Feature Correlation

plt.figure(figsize=(12, 8))
sns.heatmap(flight_df.corr(), annot=True, cmap='RdYlGn')
plt.title("Heatmap showing Correlation between features")
plt.show()

4. Building the Prediction Model: Flying Smart

Our powerful Random Forest Regressor model becomes the co-pilot in predicting flight ticket prices accurately.

# Splitting data into train and test dataframes
train_df = flight_df[0:10683]
test_df = flight_df[10683:]

# Splitting data into x and y
x = train_df.drop(["Price"], axis=1)
y = train_df.loc[:, ["Price"]].values

# Splitting the dataset into train data and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)

# Building the Random Forest Regressor model
rf_regressor = RandomForestRegressor()
rf_model = RandomizedSearchCV(estimator=rf_regressor, param_distributions=random_search, cv=3, n_jobs=-1, verbose=2, random_state=0)
rf_model.fit(x_train, y_train)

# Best parameters
print(rf_model.best_params_)

# Predicting the values
pred = rf_model.predict(x_test)
r2_score(y_test, pred)

5. Hyperparameter Tuning: Soaring to New Heights

To achieve peak performance, we embark on Hyperparameter Tuning for our Random Forest Regressor model. By optimizing parameters such as the number of estimators, maximum features, maximum depth, minimum samples split, and minimum samples leaf, we enhance the model’s accuracy and ensure a smoother flight through the sea of data.

from sklearn.model_selection import RandomizedSearchCV
random_search = {'n_estimators' : [100, 120, 150, 180, 200,220,250],
'max_features':['auto','sqrt'],
'max_depth':[5,10,15,20],
'min_samples_split' : [2, 5, 10, 15, 100],
'min_samples_leaf' : [1, 2, 5, 10]}
rf_regressor=RandomForestRegressor()
rf_model=RandomizedSearchCV(estimator=rf_regressor,param_distributions=random_search,
cv=3,n_jobs=-1,verbose=2,random_state=0)
rf_model.fit(x_train,y_train)

Conclusion: Landing with Confidence

Our data-driven journey into Flight Ticket Price Prediction with Machine Learning and EDA has been a thrilling success. Through feature engineering and Hyperparameter Tuning, we achieved an impressive model accuracy, empowering travelers to make budget-friendly decisions. As you continue to explore the world of data, remember that the possibilities are limitless.

From optimizing flight ticket prices to unlocking valuable insights, data analysis is a powerful tool for informed choices and positive impact.

Get the full code on my GitHub repo.

I’m glad you found this article helpful and informative! If you have any further questions or need assistance, feel free to reach out. Your feedback is valuable, and I appreciate your thumbs up for this article! Happy analyzing and safe travels! 🛫👍

Connect with me on LinkedIn !!!

--

--