Advanced Forecasting and Feature Engineering for Petroleum Industry Demand
Accurately predict fuel demand with time-series forecasting.
While forecasting demand with a 100% accuracy is near impossible, using intelligent forecasting algorithms against historical data can yield better predictions, enabling more informed decisions about scheduling, planning and potential supply disruptions. In this article, we will explore the various methods and models used for fuel demand forecasting. We will also examine a real-world example of the resources industry, and show how its demand forecasting could be enhanced using these intelligent models.
Context
Slalom recently helped a resources industry giant build a planning tool for their midstream business, which enables their schedulers to efficiently match supply to demand and maximise cost effectiveness. In the fuel industry, midstream operations involve transporting data from the point of production (upstream) to the point of consumption (downstream), often across vast areas to match supply with demand. The tool optimized the planning to ensure terminals are supplied with the right amount of fuel, at the right time and in the right location. This enables schedulers to make data-driven decisions that will save the company massively and help its supply chain operate more effectively.
The client had calculated the forecast of fuel demand by averaging actual fuel offloads for a month, considering fuel type and location of the terminal. These were then subdivided into days. Since demand forecasting is a very important element in a supply planning tool, we analyzed how the client was forecasting their demands, and discovered a gap between the forecasted trend and actual offload of fuel.
This led us to perform an initial exploratory data analysis to understand this mismatch so we could improve the forecasting model. Slalom created a proof of concept where we analyzed the data, identified gaps, and eventually improved the client’s forecasting model.
Data analysis
Initial analysis showed incomplete or missing actual offload data for various terminals and types of fuel. So, for our model refinement, we picked the terminal with the highest number of actual offloads recorded. For this terminal, we graphically depicted the gap between the actual and forecasted offload data. Due to forecasted value being equal for each day of the month, to better see the comparison we performed a rolling sum on the actual offload with a window of 5 days, which helped smooth the line and gave a better picture of the gap. The graph below shows the comparison:
This comparison then allowed us to further understand the actual demand data and identify the readiness of the training data for a forecasting algorithm. Our initial goal was to reduce the gap between the blue and orange line as seen in the figure above, and to give insight-backed forecasts to reduce overhead, risk, and surplus.
The recording system started in 2021. Since then, the rough figure of how much and which type of fuel is sold was manually recorded into the system every day at each terminal. However, sometimes the missing data is filled with a random number based on the recorder’s memory and experience.
Therefore, we had nearly a year’s worth of data, detailing the amount of fuel demand for each date and fuel type and terminal, on which to do our modeling. However, the data contained human error, so our analysis had to include data quality checks before identifying patterns in the time series data, and predicting demand for a particular fuel in the next order.
Checking data quality
Quality of the data is important to ensure an accurate forecasting algorithm. We first visualized, then pre-processed the data before passing it to the model.
Let’s have a closer look at the data.
As shown in Figure 2 below, overall, the demand of each fuel type (item_id) is not distributed evenly in every terminal. The gap in the demand is very big, depending on which terminal and type of fuel.
As shown in Figure 3, for terminal 2050 and 2068, the type of fuel xxx476 and xxx188 have the most available data which should be used for further analysis.
As shown in Figure 4, we’ve taken terminal 2068 and fuel type 100000188 as sample data. We see that the data fluctuates greatly from day to day. Slight seasonal patterns can be observed but not very clearly.
Identifying patterns in the data
In order to decrease the impact of errors in the data, instead of using each day’s demand for analysis, using a sum of N days’ demand makes data more accurate and is a better fit for the business model. As shown in Figure 5, data smoothing can be observed when increasing the N in the sum. We picked the 5-day rolling sum to be used for further analysis and forecasting.
It is important to determine if the data can be analysed. For instance, the data may be stationary or non-stationary, the autocorrelation may be high or low, or the data itself may be white noise. [2] Figure 6 shows this analysis which helped us identify the best model for the data. The 1st and 2nd Order Differencing in the figure below shows some level of stationarity and correlation between the data.
Predicting demand: selection of model
ARIMA and SARIMA are both common algorithms for time series forecasting [1]. As shown in Figure 7, SARIMA shows better scores when there is a spike. The Mean Absolute Percentage Error (MAPE) of ARIMA and SARIMA are 0.1201 and 0.1279 respectively. SARIMA picked up the seasonal pattern, as seen earlier, better than ARIMA. However, it still needs to be better tuned in the future.
There is another scenario based on the actual business use case. For instance, on Friday, operators order the fuel for the whole of the following week and only need to know the predicted demand for that week. In such a use case, referring to N+1 day’s predicted demand based on given N days’ data is more accurate. We did this using the SARIMA as shown in Figure 8. The result shows that the model can catch the most spikes, which is more useful when it comes to actual business flow. The data should be updated every day and the model should be trained each time data is updated to get better results.
Comparison with client’s current forecasting
If we plot the line for SARIMA predictions in Figure 7 (2 weeks) and SARIMA (1 day) predictions from Figure 8 to the plot in Figure 1, we see that building a model on the existing demand data reduces the gap between the legacy forecasting solution and actual demand, as seen in Figure 9. This plot shows a closer picture of the months of July till September where we predicted approximately 2 weeks of demand.
Wrapping up
This proof of concept started demonstrating value to the client immediately through significant improvements in how they forecast their fuel demand: for the first 3 weeks, the fuel demand forecasted by this algorithm was ~70% more accurate. Further refinement of the methods discussed in this article with larger available datasets will enhance the prediction accuracy even further — work that is currently ongoing with the Slalom team!
Acknowledgement
A huge thanks to Lin Feng for her work on this PoC and Rayqinruiwang for his guidance, Wei Qu for the encouragement for writing this piece and his feedback. I am grateful to Yasneen Ashroff, Robert Sibo, Lam Truong, Nick Jamieson for their support and feedback.
References
1. A Gentle Introduction to SARIMA for Time Series Forecasting in Python
https://machinelearningmastery.com/sarima-for-time-series-forecasting-in-python
2. An Overview of Autocorrelation, Seasonality and Stationarity in Time Series Data
3. Time Series Forecasting — A Complete Guide