Daily Passengers and Forecasting

Adi Pradana Yuda Purnomo
7 min readJan 12, 2023

--

Jakarta has many interested destinations to visit. Jakarta has many malls, market, culinary space, museum, art, culture, good society and many more.

One of the hidden gems will describe on this story. Beach and land society, yep... Jakarta near with Java Sea and Jakarta has some beautiful island to visit. Jakarta has more than 110 island around Java Sea that island in one island group, Seribu islands.

When you want to visit one of island in Seribu Island, you can go to that island from Muara Angke Harbour and Ancol. So, the passenger traffic can show up on that harbor. It will be interested to research about this case. On the shipping route in Seribu Island, there are 2 type ship, there are traditional ship (max capacity 50 passengers) and modern ship (max capacity 75 passengers).

Based on my article before : Passengers Who Comes and Go in Seribu Island. Jakarta! | by Adi Pradana Yuda Purnomo | Dec, 2022 | Medium, it will be concluded that the most intensively activity harbor is Muara Angke. In this case will forecasting in Muara Angke harbor.

There are a number of methods that it can used to forecast passenger boarding in the future. Here are a few approaches that might consider:

  1. Time series analysis: One approach is to use time series analysis techniques to forecast future passenger boarding. This involves building a statistical model that takes into account the trends and patterns in the historical data and uses this information to make predictions about the future. It can be used a variety of tools and techniques, such as moving averages, exponential smoothing, or autoregressive integrated moving average (ARIMA) models, to perform time series analysis.
  2. Machine learning: Another approach is to use machine learning techniques to forecast passenger boarding. This involves training a machine learning model on the historical data and using the model to make predictions about future passenger boarding. It can be used a variety of machine learning algorithms, such as linear regression, decision trees, or support vector machines, to build the model.
  3. Combining multiple models: It can also consider combining multiple models to make the forecasts. For example, it might be used a time series model to forecast short-term passenger boarding trends, and a machine learning model to forecast longer-term trends. By combining the predictions from multiple models, it may be able to improve the accuracy of the forecasts.

Based on that forecast approach, this research will takes into account the trends and patterns in the historical data and uses this information to make predictions about the future using time series analysis. There are many different time series forecasting methods available in Python. Some common methods include:

  1. Autoregressive integrated moving average (ARIMA)
  2. Exponential smoothing
  3. The Holt-Winters method
  4. Support vector machine (SVM)
  5. Artificial neural networks (ANNs)
  6. Decision tree

Each of these methods has its own strengths and weaknesses and is better suited to certain types of data and forecasting tasks. So that’s important to choose the appropriate method based on the characteristics of the data and the requirements of the forecasting task.

In this research, it will be chosen ARIMA to get passenger boarded time series forecasting in MUARA ANGKE.

An ARIMA model can defined by the p,d and q parameters, so for a non-seasonal time series, it described as ARIMA (p,d,q). ARIMA models can handle non-stationary time series data through differencing, a time series transformation technique. When seasonality is present, it can be used the Seasonal ARIMA (SARIMA) model. Regression with ARIMA errors combines two powerful statistical models namely, Linear Regression, and ARIMA (or Seasonal ARIMA), into a single super-powerful regression model for forecasting time series data.

The following schematic illustrates how Linear Regression, ARIMA and Seasonal ARIMA (SARIMA) models are combined to produce the Regression with ARIMA errors model:

Linear Regression in ARIMA and SARIMA

The source data will be same with the previous research : Passengers Who Comes and Go in Seribu Island. Jakarta! | by Adi Pradana Yuda Purnomo | Dec, 2022 | Medium.

Let’s import the libraries what we will using.

Import library on python

The trouble has come in ARIMA for forecast Data Shipping in Seribu Island, ARIMA must has not ‘0’ value on any columns and that date value always in continuity. The data in Seribu island has 0 value on any columns (on the pandemic peak season on 2020, MUARA ANGKE not accepting passengers). So In this research, it used the data in December 2021. The CSV file has edited manually.

Load the dataset based on CSV file. The delimiter of CSV file using comma ‘,’ separator. The dataset will be a data frame using pandas.

Load dataset from CSV file and showing data sampling

The date column converted to datetime.

The date column converted to datetime

Set the date column as the index.

Set the date column as the index

Create a time series object from the data.

Time series object created from the data

Split the data into training and testing sets.

The data splited into training and testing sets

The splitted data will be added on train and test variable.

Train and Test Variable

Fit the ARIMA model to the training data.

ARIMA model and train data had combined to fit
The training data had fit to the ARIMA model

ARIMA, or AutoRegressive Integrated Moving Average, is a statistical model that can be used to analyze and forecast time series data. The model is denoted by the notation ARIMA(p, d, q), where p, d, and q are integers representing the model's hyperparameters.

The p parameter represents the number of autoregressive terms in the model. Autoregressive terms are lagged values of the time series data that are included in the model as predictor variables. A higher value of p means that the model will include more lagged values in the model, which can increase the model's ability to capture patterns in the data, but may also increase the risk of overfitting.

The d parameter represents the number of times that the time series data have been differenced to make the data stationary. Stationarity refers to the statistical properties of the data, such as the mean and variance, being constant over time. Non-stationary data can be difficult to model and forecast, so it is often necessary to difference the data to make it stationary before fitting an ARIMA model. A higher value of d means that the data will be differenced more times, which can help to stabilize the statistical properties of the data, but may also make the data more difficult to interpret.

The q parameter represents the number of moving average terms in the model. Moving average terms are lagged forecast errors included in the model as predictor variables. A higher value of q means that the model will include more lagged forecast errors in the model, which can help to capture patterns in the data that are not captured by the autoregressive terms, but may also increase the risk of overfitting.

So, for example, an ARIMA model with the notation ARIMA(1, 1, 1) would have 1 autoregressive term, 1 difference, and 1 moving average term. This model would include lagged values of the time series data as predictor variables, and would have differenced the data once to make it stationary. It would also include lagged forecast errors as predictor variables.

Make predictions on future values of the time series. The time series between 1 January 2022 until 31 January 2022.

Time series predicitions set up

Print the prediction result.

The prediciton showing Up!

Explanation :

Based on forecasting on 1 January 2022 until 31 January 2022 still in same number (260.362511), Because the ARIMA just send 1 result of forecasting, it will conclude, the MUARA ANGKE harbor can get around 260 boarded passengers in January 2022. So In this research, it used the data in December 2021.

Reference :

https://data.jakarta.go.id (access on December 7th 2022).

Atwan, T.A. (2022) Time series analysis with Python cookbook: Practical recipes for exploratory data analysis, data preparation, forecasting, and model evaluation. Birmingham: Packt Publishing.

Hyndman, R.J. and Athanasopoulos, G. (2021) Forecasting: Principles and practice. Melbourne: OTexts.

--

--