Crude Brent Oil (from Analysis to Forecasting) — Python and Pytorch code — part1:1

M Alruqimi
8 min readMay 29, 2024

--

Part 1: Data collection and exploration

The complete code is available on GitHub.

Intoduction

In this series of articles, we will focus on forecasting Brent crude oil prices. First, we will collect Brent oil prices along with a set of external factors that influence these prices, such as the USD Dollar Index, gold prices, natural gas prices, and a sentiment score. We will demonstrate how to use simple Python code to merge these time series into a single dataset, clean the data, and then proceed with plotting, analyzing, and performing statistical tests on the data.

Next, we will build several deep learning models to forecast Brent oil prices. We will begin with simpler models using Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. Ultimately, we will progress to advanced forecasting models by integrating state-of-the-art techniques, such as Generative Adversarial Networks (GANs) and stochastic differential equations.

what is time series ?

A time series is a sequence of data points collected or recorded at regular time intervals. It is used to analyze trends, patterns, and behaviors over time, often in fields like finance, economics, and meteorology. Examples include daily stock prices, electricity loading and prices, monthly sales figures, or annual rainfall measurements.

Crude Brent

Crude Brent oil is a major trading classification of sweet light crude oil that serves as a benchmark for international oil prices. It is sourced from the North Sea and is used to price two-thirds of the world’s internationally traded crude oil supplies.

Start coding ….

Importing Necessary Packages

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np

Data collection and merging

The following code is designed to read specific columns from multiple sheets of an Excel file and merge them into a single dataset based on the ‘Date’ column. It starts by defining the columns to be read from each sheet, then loads the data from an Excel file named ‘Features.xlsx’ for each sheet, selecting only the specified columns and parsing dates. After loading the data, the code merges the individual datasets on the ‘Date’ column, ensuring that all relevant data is combined into one DataFrame. Finally, it prints the first three rows of the merged dataset to verify the result.
you can access the dataset via this GitHub repos.
And for the sentimental score via this GitHub.


# Define the columns you want to read from each sheet
brent_columns = ['Date', 'Close']
usd_columns = ['Date', 'Close']
sa_energ_columns = ['Date', 'Close']
sentiment_columns = ['Date', 'Csum_CrudeBERT_Plus_GT']
gold_future_columns = ['Date', 'Close']
natural_gas_columns = ['Date', 'Close']

# Load the data from the Excel file sheets, specifying only the required columns
DATAPATH_Features = 'Features.xlsx'
brent = pd.read_excel(open(DATAPATH_Features, 'rb'), sheet_name='Brent', engine='openpyxl', usecols=brent_columns, parse_dates=True)
usd = pd.read_excel(open(DATAPATH_Features, 'rb'), sheet_name='USD', engine='openpyxl', usecols=usd_columns, parse_dates=True)
TENT = pd.read_excel(open(DATAPATH_Features, 'rb'), sheet_name='SAUDI_ENERG', engine='openpyxl', usecols=sa_energ_columns, parse_dates=True)
SENTI = pd.read_excel(open(DATAPATH_Features, 'rb'), sheet_name='Sentiment', engine='openpyxl', us
ecols=sentiment_columns, parse_dates=True)
gold = pd.read_excel(open(DATAPATH_Features, 'rb'), sheet_name='GOLD_Future', engine='openpyxl', usecols=gold_future_columns, parse_dates=True)
gas = pd.read_excel(open(DATAPATH_Features, 'rb'), sheet_name='Natural Gas Future', engine='openpyxl', usecols=natural_gas_columns, parse_dates=True)

# load all the dataset in one data frame
merged_data = brent.merge(usd, on='Date', how='left', suffixes=('', '_usd'))
merged_data = merged_data.merge(TENT, on='Date', how='left', suffixes=('', '_tent'))
merged_data = merged_data.merge(SENTI, on='Date', how='left', suffixes=('', '_senti'))
merged_data = merged_data.merge(gold, on='Date', how='left', suffixes=('', '_gold'))
merged_data = merged_data.merge(gas, on='Date', how='left', suffixes=('', '_gas'))

print(merged_data.head(3))
merged_data.info()

This is to ensure that the original dates in the ‘Date’ column of the DataFrame have not been altered during the merging process.


merged_dates__ = pd.DataFrame(merged_data.index.values)
print(merged_dates__.tail())
orginal_dates= pd.DataFrame(brent.index.values)
print(orginal_dates.tail())

res1= merged_dates__[0].equals(orginal_dates[0])
print(res1)

You should see some outputs like this:

Since the sentiment score is available only for the specific duration from 2012–01–01 to 2021–04–01, I need to filter the dataset to match this timeframe.

# Define the date range
start_date = '2012-01-01'
end_date = '2021-04-01'

# Convert the Date column to datetime if it is not already
merged_data['Date'] = pd.to_datetime(merged_data['Date'])

# Filter the dataframe based on the date range
filtered_data = merged_data[(merged_data['Date'] >= start_date) & (merged_data['Date'] <= end_date)]

# Verify the filtering
print(filtered_data.head(1))
print(filtered_data.tail(1))

# Optional: Display a summary to verify the filtering
print(filtered_data.info())


# Let's drop the first 6 rows to be just 2380 row
filtered_data = filtered_data[6:]
filtered_data.info()

Check and fill missing values

The following code fills in missing values in the filtered_data DataFrame using linear interpolation, updating the DataFrame in place.
Actuallay, you have many options to fill missing values as :
1. Interpolation: Estimates missing values based on existing data points, such as linear or spline interpolation.
2. Mean/Median/Mode Imputation: Replaces missing values with the mean, median, or mode of the column.
3. Forward/Backward Fill: Propagates the next or previous non-missing value forward or backward.
4. Constant Value: Fills missing values with a specified constant, such as zero or a specific number.
5. Drop Missing: Removes rows or columns that contain missing values.

# Fill NaN values with linear interpolation
filtered_data.interpolate(method='linear', inplace=True)

# Forward fill to handle leading NaNs
filtered_data.ffill(inplace=True)

# Backward fill to handle trailing NaNs
filtered_data.bfill(inplace=True)

# Verify that NaN values are filled
print(filtered_data.isna().sum()) # This will print the count of NaN values in each column, which should be 0 if all NaNs were filled correctly.

filtered_data.head(3)
Dataset smaple at this stage

You can always save your work to excel file

filtered_data.to_csv("data.csv", encoding='utf-8', index=False)

Explore and analysis

filtered_data.to_csv("data.csv", encoding='utf-8', index=False)  # You can do some plats
df= pd.read_csv('data.csv')


# Create a figure with subplots arranged in 2 rows and 2 columns
fig, axs = plt.subplots(2, 2, figsize=(10, 6))

# Plot the first histogram
axs[0, 0].hist(df['Close'], bins=20)
axs[0, 0].set_title('Brent')
axs[0, 0].spines['top'].set_visible(False)
axs[0, 0].spines['right'].set_visible(False)


# Plot the second histogram
axs[0, 1].hist(df['Csum_CrudeBERT_Plus_GT'], bins=20)
axs[0, 1].set_title('SENTI')
axs[0, 1].spines['top'].set_visible(False)
axs[0, 1].spines['right'].set_visible(False)

# Plot the first scatter plot
axs[1, 0].scatter(df['Close'], df['Close_usd'], s=32, alpha=0.8)
axs[1, 0].set_title('Brent vs USD')
axs[1, 0].set_xlabel('Brent')
axs[1, 0].set_ylabel('USD')
axs[1, 0].spines['top'].set_visible(False)
axs[1, 0].spines['right'].set_visible(False)

# Plot the second scatter plot
axs[1, 1].scatter(df['Close'], df['Csum_CrudeBERT_Plus_GT'], s=32, alpha=0.8)
axs[1, 1].set_title('Brent vs SENTI')
axs[1, 1].set_xlabel('Close')
axs[1, 1].set_ylabel('SENTI')
axs[1, 1].spines['top'].set_visible(False)
axs[1, 1].spines['right'].set_visible(False)

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

The previous code is to build some plots. You should see such figures

Data Normalisation

Scaling data is crucial for machine learning models because it ensures that all features contribute equally to the model’s performance and prevents features with larger ranges from dominating the learning process. Properly scaled data improves model convergence speed and accuracy, especially for algorithms sensitive to feature magnitudes, such as gradient descent-based methods and distance-based algorithms like k-nearest neighbors.

One common approach to scaling data is using the preprocessing module from the scikit-learn library. This module provides various scaling techniques, such as StandardScaler for standardizing features by removing the mean and scaling to unit variance, MinMaxScaler for scaling features to a specified range, and RobustScaler for scaling features using statistics that are robust to outliers. Using these tools, you can ensure your data is appropriately scaled for effective model training and analysis.


#Scaling the data
from sklearn import preprocessing

# Separate the 'Date' column from the rest of the data
date_column = df[['Date']]
data_columns = df.drop('Date', axis=1)

# Initialize the MinMaxScaler
scaler = preprocessing.MinMaxScaler()

# Fit and transform the data columns
scaled_data = scaler.fit_transform(data_columns)

# Convert the scaled data back to a dataframe
scaled_df = pd.DataFrame(scaled_data, columns=data_columns.columns)

# Concatenate the 'Date' column back to the scaled dataframe
scaled_df = pd.concat([date_column.reset_index(drop=True), scaled_df], axis=1)

# Display the scaled dataframe
print(scaled_df.head())

Time series decomposition

Time series decomposition is a technique used to break down a time series into its constituent components to better understand the underlying patterns and structures. Typically, a time series can be decomposed into three main components: trend, seasonality, and residuals (or noise). The trend component captures the long-term progression or direction of the data, seasonality reflects repeating patterns or cycles over specific periods, and residuals represent the random variations or irregularities. By decomposing a time series, analysts can isolate these elements, making it easier to identify and model each component separately. This decomposition aids in more accurate forecasting and analysis, providing clearer insights into the behavior of the time series over time.

#To decompose the Brent price components into trend, seasonality, 
#and residuals, you can use the seasonal_decompose function from the statsmodels library.

from statsmodels.tsa.seasonal import seasonal_decompose

brent = df [['Date', 'Close']]
brent.set_index('Date', inplace=True)

# Decompose the Brent price data
decomposition = seasonal_decompose(brent['Close'], model='additive', period=7)

# Extract the components
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

# Plot the decomposed components
plt.figure(figsize=(10, 6))

plt.subplot(411)
plt.plot(brent['Close'], label='Original', color='blue')
plt.legend(loc='upper left')
plt.title('Original Brent Price')

plt.subplot(412)
plt.plot(trend, label='Trend', color='orange')
plt.legend(loc='upper left')
plt.title('Trend')

plt.subplot(413)
plt.plot(seasonal, label='Seasonal', color='green')
plt.legend(loc='upper left')
plt.title('Seasonal')

plt.subplot(414)
plt.plot(residual, label='Residual', color='red')
plt.legend(loc='upper left')
plt.title('Residual')

plt.tight_layout()
plt.show()

You should see outputs like that

Visualisation

We want to see Brent oil prices and other time series togther in a line bar figure:


plt.figure(figsize=(10, 5), dpi=150)

scaled_df['Close'].plot(label='Price')
scaled_df['Close_usd'].plot(label='USD')
scaled_df['Csum_CrudeBERT_Plus_GT'].plot(label='SENT')

# adding title
plt.title('Brent Oil Price Plot compared to others')

#plt.ylabel('Prices & Volume')
plt.xlabel('Date'

Correlation Test

The Spearman correlation test is a non-parametric measure used to assess the strength and direction of the monotonic relationship between two time series. Unlike the Pearson correlation, which assumes a linear relationship and normal distribution of data, the Spearman correlation evaluates how well the relationship between the two variables can be described using a monotonic function. This test ranks the data points and computes the correlation based on these ranks, making it robust to outliers and suitable for non-linear relationships. In time series analysis, the Spearman correlation test is valuable for identifying and analyzing relationships between time series that may not adhere to linear assumptions, thereby providing a more flexible tool for understanding complex dependencies and interactions in the data.

# Heatmap Spearman Correlation for the data 
df_scaled_no_dat = scaled_df.drop('Date', axis=1)

plt.figure(figsize=(10,8))
matrix = np.triu(df_scaled_no_dat.corr(method='spearman'))
heat_map = sns.heatmap(df_scaled_no_dat.corr(method='spearman'),annot=True,annot_kws={"size":14},cmap= 'YlGnBu',mask=matrix)
heat_map.set_yticklabels(heat_map.get_yticklabels(), rotation=60)
heat_map.set_xticklabels(heat_map.get_xticklabels(), rotation=60)
plt.tick_params(labelsize=12)
plt.title('Heatmap Spearman Correlation')
plt.style.use('fivethirtyeight')
df_scaled_no_dat.corr(method='spearman')

--

--