Oil Price Forecasting Using Conditional Generative Adversarial Networks (GANs) with Sentiment Analysis

5 min readJul 3, 2024

Part 2: Data preparation

You can view the relevant articles from via these links:
Part 1: Introduction
Part3: Build and train the model.
Brent Oil price: exploratory data analysis (EDA)
The full code and dataset (GitHub)

Data Preparation

We used a real-world datasets for the experiments: crude Brent prices. The dataset consists of daily observations from January 3, 2012, to April 1, 2021, encompassing the period during which the COVID-19 pandemic impacted the oil and stock markets.
To enhance our model for estimating the volatility, we incorporated a daily cumulative sentimental score (SENT) with the dataset as external feature. The SENT feature is obtained from the CrudeBERT model described in https://arxiv.org/abs/2305.06140.
The code and data are available on GitHub:

GitHub - Med-Rokaimi/GAN_synthetic_data_time_series: Forecasting Crude Oil Prices Using Conditional…

Forecasting Crude Oil Prices Using Conditional Generative Adversarial Networks (CGAN) and Stochastic Differential…

github.com

Let’s start coding:

I will import all libraries will be used in this series:

from collections import namedtuple
import torch
from torch import nn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from torch.utils.data import TensorDataset, DataLoader

#This is for data normalisation later
mm = MinMaxScaler()
ss = StandardScaler()

Main tasks in the data prepration includes: reading the CSV file, normalising the data, create the sequences (sliding window), and split the data into train, validation and test data. Additionally, build the Pytorch dataloader.

For simplicity, I will define the necessary functions for data preparation separately. However, you can build your own data preparation class if you prefer.

Read the CSV file.

The CSV file contains four columns: Date (index column), Price (Brent Price), WTI Price, SENT (the sentimental score) and TENT (Saudi energy index). I will load only the Brent prices (Price) and the SENT feature.

df = pd.read_csv('oil.csv')

# I will load only two features (columns)
df = df[['Price', 'SENT']]  
target_column = 'Price'
df.head(3)

train_size, valid_size, test_size = 2000, 260, 100

So our dataset will look like:

2. Data Normalisation

def normalize__my_data_(X, y):
  X_trans = ss.fit_transform(X)
  y_trans = mm.fit_transform(y.reshape(-1, 1))
  return X_trans, y_trans

Normalizing data is a crucial step in preparing time series data for deep learning models for faster convergence, balanced learning, numerical stability, improved performance, and consistent predictions. I utilized StandardScaler to standardize the features, ensuring they have a mean of zero and a standard deviation of one. Additionally, I applied MinMaxScaler to normalize the target variable, scaling it to a range between 0 and 1. This combination helps improve model performance and convergence.

3. Creating sequences (lags)

def split_sequences(input_sequences, output_sequence, n_steps_in, n_steps_out):
    """
    Splits the input and output sequences into lagged input-output pairs.

    Parameters:
    - input_sequences: The input time series data (X_trans, Price and SENT).
    - output_sequence: The target time series data (y_trans, the target column).
    - n_steps_in: The number of time steps to use as input (lag window size).
    - n_steps_out: The number of time steps to predict (forecast horizon).

    Returns:
    - X: Array of input sequences (shape: [samples, n_steps_in, features]).
    - y: Array of output sequences (shape: [samples, n_steps_out]).
    """
    X, y = list(), list() # instantiate X and y
    for i in range(len(input_sequences)):
        # find the end of the input, output sequence
        end_ix = i + n_steps_in
        out_end_ix = end_ix + n_steps_out - 1
        # check if we are beyond the dataset
        if out_end_ix > len(input_sequences): break
        # gather input and output of the pattern
        seq_x, seq_y = input_sequences[i:end_ix], output_sequence[end_ix-1:out_end_ix, -1]
        X.append(seq_x), y.append(seq_y)
    return np.array(X), np.array(y)

The split_sequence function is used to build a sliding window that captures a fixed number of prior time steps (lags) to use as input features for forecasting models. Suppose you have a time series:

If you choose a lag window of 3, your input-output pairs for a predictive model might look like this:

Now, I will build the data_prep function. In this function, I will call sub-functions for normalisation the dataset and building the sequences. As well as splitting the dataset into train, validation and test.

def split_train_test_pred (X_ss, y_mm , train_test_cutoff, vald_size, predict_size):
    X_train = X_ss[:train_test_cutoff]
    X_valid = X_ss[train_test_cutoff: train_test_cutoff + vald_size]

    y_train = y_mm[:train_test_cutoff]
    y_valid = y_mm[train_test_cutoff: train_test_cutoff + vald_size]

    X_test = X_ss[-predict_size:]
    y_test = y_mm[-predict_size:]

    data = {"X_train": X_train, "y_train": y_train, "X_valid": X_valid, "y_valid": y_valid, "X_test": X_test,
            "y_test": y_test}
    return data

def data_prep(df, target,  seq_len, pred_len, train_size, valid_size, test_size):
    
    """
    Prepares the dataset for time series forecasting by normalizing the data,
    building sequences, and splitting it into training, validation, and test sets.

    Parameters:
    - df: DataFrame containing the time series data.
    - target: The target column name to predict.
    - seq_len: The length of the input sequences (lag window size).
    - pred_len: The length of the output sequences (forecast horizon).
    - train_size: The number of samples to include in the training set.
    - valid_size: The number of samples to include in the validation set.
    - test_size: The number of samples to include in the test set.

    Returns:
    - data: A dictionary containing the training, validation, and test sets.
    """

    X, y = df, df[target].values
    # Normalization
    X_trans, y_trans = normalize__my_data_(X, y)

    # Build the sequence
    X_ss, y_mm = split_sequences(X_trans, y_trans, seq_len, pred_len)

    train_test_cutoff = train_size
    vald_size = valid_size
    test_size = test_size
    data = split_train_test_pred(X_ss, y_mm, train_test_cutoff, vald_size, test_size)
    return data

4. Create PyTorch tensors

x_train = torch.tensor(data['X_train'], device=device, dtype=torch.float32)
y_train = torch.tensor(data['y_train'], device=device, dtype=torch.float32)
x_val = torch.tensor(data['X_valid'], device=device, dtype=torch.float32)
y_val = data['y_valid']

Now, we are ready to build and train our model.
Go To Part 3: Build and train the model

Part 1: Introduction
Part3: Build and train the model.
Brent Oil price: exploratory data analysis (EDA)
The full code and dataset (GitHub)

Oil Price Forecasting Using Conditional Generative Adversarial Networks (GANs) with Sentiment Analysis

Data Preparation

GitHub - Med-Rokaimi/GAN_synthetic_data_time_series: Forecasting Crude Oil Prices Using Conditional…

Forecasting Crude Oil Prices Using Conditional Generative Adversarial Networks (CGAN) and Stochastic Differential…

Written by M Alruqimi