How Time Series Can Help to Predict est Travel Time

Ganang Rahmadi
tiket.com
Published in
9 min readFeb 28, 2023

In the flight industry, accurate forecasting is a critical component for airlines to optimize their operations and maximize their profitability. Airline companies need to make informed decisions about pricing, capacity, and other aspects of their business based on future demand. To make such decisions, they must rely on accurate forecasting of various factors, including flight bookings.

However, predicting flight bookings is not an easy task, and airlines face several challenges in accurately forecasting demand. One of the most significant factors is seasonality. For instance, airline companies experience higher demand during peak seasons, such as the summer holidays or the festive season, compared to other times of the year. Other factors that can impact flight bookings include competition, economic conditions, and external events such as natural disasters, weather conditions, or political unrest.

To overcome these challenges, airlines can use advanced analytics and machine learning algorithms to improve their predictions. These algorithms can analyze historical booking data, customer behavior, and other factors to identify patterns and make predictions about future demand.

Online travel agencies (OTAs) also rely heavily on accurate booking forecasting. OTAs use machine learning algorithms to analyze data from various sources, such as search histories, customer preferences, and other data points, to predict future demand for flights. By doing so, they can adjust their pricing accordingly and offer more competitive prices to attract customers and gain a larger market share. Here are some of the most common techniques to predict future demand for flights:

  1. Time Series Analysis: Time series analysis is a statistical technique that involves analyzing historical data to identify patterns and trends. This technique is widely used in the OTA industry to forecast flight bookings. Time series analysis can help identify seasonal trends, such as high demand during peak travel seasons, and predict future demand based on historical patterns.
  2. Regression Analysis: Regression analysis is a statistical technique that involves analyzing the relationship between two or more variables. In the OTA industry, regression analysis can be used to predict flight bookings based on various factors, such as the price of tickets, the time of year, and the destination.
  3. Machine Learning: Machine learning is a branch of artificial intelligence that involves training algorithms to learn from data and make predictions. Machine learning techniques, such as neural networks and random forests, can be used in the OTA industry to predict flight bookings based on historical data.

Python Examples:

Step 1: Import libraries NumPy, Pandas, Matplotlib, and scikit-learn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import warnings
warnings.filterwarnings('ignore')

Step 2: Load the data from the file or the source you have, in this example, we are using a CSV file.

data = pd.read_csv('flight_passengers_2015-2022.csv')
data.head()

Step 3: Data preprocessing is an essential step to clean the data and prepare it for modeling. It involves handling missing values, removing duplicates, and handling outliers.

#mean imputation
data = data.assign(imputation_mean = data['passengers'].fillna(data['passengers'].mean()))
#outlier detection
figure = plt.subplots(figsize=(12,6))
axes = plt.boxplot(data['passengers'], whis=1.5)
#descriptive statistics
check = data['passengers'].describe(percentiles= [0.25, 0.5, 0.75, 0.8, 0.95, 0.98])
#data distribution
data['passengers'].hist(figsize=(12,6))

Step 4: Visualize the data to understand the pattern, whether it is a trend, seasonality, or other data patterns.

data.plot(figsize=(12,6))

Step 5: Feature Engineering Extract relevant features from the data to train the model. In this example, we will use only the date and sales column.

decomposition = sm.tsa.seasonal_decompose(data['Passengers'], model='multiplicative')
figure = decomposition.plot()
plt.show()

Step 6: Split the data into training and testing. In this example, we will use an 80–20 split.

#2015 to 2018 80%
train_len = 48
train = data[0:train_len]
#2019 20%
test = data[train_len:]
train.head()
test.head(

Step 7: Model Training Train the model on the training data using several forecasting methodology.

a. Naive method

Simple forecasting technique that assumes that the future value of a time series will be the same as its current value. In other words, it assumes that there will be no change in the time series over time.

#create new data frame from the existing one for one method
y_pred_naive = test.copy()
y_pred_naive['naive_pred'] = train['Passengers'][train_len-1]

#plot the result
plt.figure(figsize=(12,4))
plt.plot(train['Passengers'], label='Train')
plt.plot(test['Passengers'], label='Test')
plt.plot(y_pred_naive['naive_pred'], label='Naive Method Prediction')
plt.legend()

b. Simple average method

Average of all the train data values assigned to the test data.

y_pred_avg = test.copy()
y_pred_avg['avg_method'] = train['Passengers'].mean()
plt.figure(figsize=(12,6))
plt.plot(train['Passengers'], label='Train')
plt.plot(test['Passengers'], label='test')
plt.plot(y_pred_avg['avg_method'], label='Simple average method')
plt.legend()

c. Moving average method

The moving average method is a time series forecasting technique that uses a rolling window of past observations to predict future values. It involves calculating the average of the past ’n’ observations, where ’n’ is the window size, and using it as the forecast for the next period.

#taking the avg of the last 12 months and forecasting that avg value for the future months
y_pred_ma = data.copy()
ma_6 = 6
y_pred_ma['ma_pred'] = data['Passengers'].rolling(ma_6).mean()
y_pred_ma['ma_pred'][train_len:] = y_pred_ma['ma_pred'][train_len-1]

#plot
plt.figure(figsize=(12,6))
plt.plot(train['Passengers'], label='Train')
plt.plot(test['Passengers'], label='test')
plt.plot(y_pred_ma['ma_pred'], label='Simple moving average forecast')
plt.legend()

d. Holt-Winters’s Exponential Smoothing

Captures Level, Trend and Seasonality.

#model
y_pred_hwe = test.copy()
model = ExponentialSmoothing(np.asarray(train['Passengers']), seasonal_periods = 12, trend='additive', seasonal='additive')
model_fit = model.fit(optimized=True)

#forecast last 12 month
y_pred_hwe['hwa_pred'] = model_fit.forecast(12)

#plot
plt.figure(figsize=(12,6))
plt.plot(train['Passengers'], label='Train')
plt.plot(test['Passengers'], label='test')
plt.plot(y_pred_hwe['hwa_pred'], label='Holt\'s Winters additive exponential smoothing forcast')
plt.legend()

Analysis

  1. It captured the trend, and we can see that the forecasted values are a little lower than the actual level.
  2. It also captured the seasonality.
  3. The forecasted value’s peak is still lower than the actual.

e. ARIMA

  1. Stationary vs Non-stationary Time Series

We can see that there is an increasing trend. As a result, the trend is not constant. Variance is also not constant.

2. Augmented Dickey-Fuller (ADF) Test

ADF stats: 0.684980
p-value: 0.989533
Critical value @ 0.05: -2.91

p-value (0.99) > alpha(0.05) .

Reject null hypothesis (The pattern is not stationary)

3. Converting Non-Stationary to Stationary pattern

Box-Cox transformation makes the variance constant in a series.

from scipy.stats import boxcox
# Creating a new dataset with the boxcox
data_bc = pd.Series(boxcox(data['Passengers'], lmbda=0), data.index)
# Plotting the Time series after transformation
plt.figure(figsize=(12,6))
plt.plot(data_bc, label='Box Cox Transformation')
plt.legend()

We can see that the variance is more constant after the box-cox transformation. But we can see that the pattern still has an upward trend. So, the mean is not constant yet. So, we need to do differencing to make the mean constant.

4. Differencing

Differencing is performed by subtracting the previous observation from the current observation. Differencing removes a time series’ trend and seasonality. When an entire cycle is used to differentiate one cycle from another, seasonality is removed.

# Plotting the Time series after Box Cox transformation and Differencing
plt.figure(figsize=(12,6))
plt.plot(data_boxcox_diff, label='Box Cox Transformation adn Differencing')
plt.legend()

We can see that there is no trend (upward or downward) after differencing on the Box-Cox transformation. It is a horizontal trend. The mean became constant. The mean is zero. Also, the variance is almost constant.

ADF stats: -12.531653
p-value: 0.000000
Critical value @ 0.05: -2.91

p-value(0.01) < alpha(0.05) ADF stats < Critical value . Reject the null hypothesis. The pattern is Stationary.

5. Autocorrelation function (ACF)

Autocorrelation function captures both direct and indirect relationship with its lagged values.

from statsmodels.graphics.tsaplots import plot_acf
plt.figure(figsize=(12,6))
plot_acf(data_boxcox_diff, ax=plt.gca(), lags=24)

Even though the pattern is not significant for seasonal data, we can see there is a seasonal pattern every 3 months.

6. Partial autocorrelation function (PACF)

Captures only direct correlation.

plt.figure(figsize=(12,6))
plot_pacf(data_boxcox_diff, ax=plt.gca(), lags=24)

7. Building Simple Auto Regressive Model

An autoregressive (AR) model is a type of time series model that uses past observations of a variable to predict future values. In an AR model, the current value of a variable is regressed on its previous values, or lags, with the assumption that past values can explain or help predict future values.

# Splitting data_boxcox
train_data_boxcox = data_bc[:train_len]
test_data_boxcox = data_bc[train_len:]

# Taking train_len-1 because we have deleted the first observation
train_data_boxcox_diff = data_boxcox_diff[:train_len-1]
test_data_boxcox_diff = data_boxcox_diff[train_len-1:]

from statsmodels.tsa.arima_model import ARIMA
# p = 1 . We are calculating only one lag time period
# q = 0 . We have already made the time series as Stationary
# q = 0 . We still not calculate the moving average to the model
model = ARIMA(train_data_boxcox_diff, order=(1,0,0))
model_fit = model.fit()

#forecast
y_pred_ar = data_boxcox_diff.copy()
y_pred_ar['ar_forecast_boxcox'] = y_pred_ar['ar_forecast_boxcox_diff'].cumsum()
y_pred_ar['ar_forecast_boxcox'] = y_pred_ar['ar_forecast_boxcox'].add(data_bc[0])
y_pred_ar['ar_forecast'] = np.exp(y_pred_ar['ar_forecast_boxcox'])

#plot
plt.figure(figsize=(12,6))
plt.plot(train['Passengers'], label='Train')
plt.plot(test['Passengers'], label='Test')
plt.plot(y_pred_ar['ar_forecast'][test.index.min():], label='Auto regression forecast (AR)')
plt.legend()

8. Building Moving Average Model (MA)

A moving average (MA) model is a type of time series model used for forecasting future values based on the past errors or residuals of the time series. The MA model assumes that the current value of the time series depends on the average of past errors or residuals, which are calculated as the difference between the actual and predicted values at each time period.

model = ARIMA(train_data_boxcox_diff, order=(0,0,1))
model_fit = model.fit()

Still, there is no capture of seasonality. Both in AR model or MA model.

9. ARIMA Model

model = ARIMA(train_data_boxcox, order=(1,1,1))
model_fit = model.fit()

We can see there is still no seasonality capture using ARIMA, ARIMA & ARMA Actually have the same result because in ARMA model we are manually doing differencing but in ARIMA it is automatically (1,1,1).

10. SARIMA model

ARIMA with Seasonal component added.

from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(train_data_boxcox, order=(1,1,1), seasonal_order=(1,1,1,12))
model_fit = model.fit()

The SARIMA method has done reasonably well as it captures seasonality.

Step 8: Model Evaluation Evaluate the model on the testing data and calculate the accuracy score.

RMSE (Root Mean Squared Error) and MAPE (Mean Absolute Percentage Error) are two commonly used metrics to measure the accuracy of a predictive model. Here’s what they mean and how to interpret them:

  1. RMSE is a measure of the difference between the predicted values and the actual values. It is calculated as the square root of the average of the squared differences between the predicted and actual values. RMSE gives a measure of the typical error in the predictions made by a model.
  2. MAPE measures the percentage difference between the predicted and actual values. It is calculated as the average of the absolute percentage differences between the predicted and actual values. MAPE is often used in forecasting and is particularly useful when the scale of the data is not uniform.

From the table, RMSE and MAPE Above, we can see that the SARIMA Method outperforms the others. The performance is comparable with the other method, such as Holt’s Winter additive or multiplicative, which also produces seasonality.

To sum up, precise forecasting of flight bookings plays a vital role in optimizing the operations and maximizing profitability of airlines and travel agencies. The utilization of advanced analytics and machine learning algorithms has provided companies in the flight industry with the means to make better-informed decisions about pricing and capacity, ultimately resulting in enhanced customer satisfaction and financial outcomes.

References :

https://github.com/sahidul-shaikh/time-series-forecasting-airline-passenger-traffic/blob/main/airline-passenger-traffic.ipynb

--

--