Crash Course in Time Series Forecasting: Analysing Cryptocurrency data

Published in

AI Skunks

18 min readApr 10, 2023

Imagine you have a magic crystal ball that can tell you the future. Now, let’s say you want to use this crystal ball to predict the stock prices of your favourite company. You might be thinking, “how can a crystal ball predict stock prices?” Well, that’s where time-series forecasting comes in. Time-series forecasting is like having a crystal ball for data.

In simple terms, time-series forecasting is the art of predicting future values based on patterns and trends from past data. It’s like looking at the historical data of a stock price and using that information to predict what the stock price will be in the future.

Just like we might look for patterns in the stock market to predict future prices, time-series forecasting algorithms use mathematical models to identify trends, seasonality, and other patterns in data to make predictions. These models can be as simple as a moving average or as complex as a neural network.

There are many factors that can affect the accuracy of time-series forecasts. For example, unexpected events like a pandemic can disrupt patterns and trends in data, making it difficult to make accurate predictions. That’s why it’s important to constantly monitor and adjust the forecasts as new data becomes available.

Why time-series forecasting techniques?

Perhaps you’re wondering about the distinction between time-series forecasting and algorithmic predictions utilizing machine learning. Although machine learning techniques like random forest, gradient boosting regressor, and time delay neural networks can be leveraged to extrapolate time-series data, there are other alternatives available that may be more suitable. As we’ll soon discover in this article, the ability to extrapolate patterns outside the training data domain is a crucial characteristic of a time-series algorithm, which most machine learning methods lack by default. Luckily, there are specialized time-series forecasting approaches that can fill this gap.

Let’s begin!

Let’s look at some crucial concepts which are needed for analyzing and forecasting time-series.

Stationarity:
The concept of stationarity is crucial in time-series forecasting. A time series is said to be stationary if its statistical properties like mean, variance, and autocorrelation do not change over time. Stationarity is important because many time-series forecasting models assume that the data is stationary. If the data is not stationary, it may need to be transformed before it can be used in these models.

Autocorrelation:
Autocorrelation is the correlation between a time series and a delayed version of itself. It’s an important concept in time-series forecasting because it can help identify patterns and trends in the data. For example, if there is a high positive autocorrelation at a lag of 12 months, it might indicate a yearly pattern in the data.

Seasonality:
Seasonality refers to patterns that repeat themselves at regular intervals, such as weekly, monthly, or yearly. Seasonality is important in time-series forecasting because it can help identify trends and make accurate predictions. For example, if a time series exhibits weekly seasonality, a forecasting model can be trained to predict the values for each day of the week.

Trend:
Trend refers to the long-term movement of a time series. It can be upward, downward, or flat. Trend is important in time-series forecasting because it can help identify the direction of future values. For example, if a time series has an upward trend, it’s likely that future values will also be higher.

Forecasting models:
There are several forecasting models that can be used for time-series analysis, including exponential smoothing, autoregressive integrated moving average (ARIMA), and machine learning models such as neural networks. Each model has its own strengths and weaknesses and may be more appropriate for certain types of data.

Error Metrics:
Error metrics are used to evaluate the performance of time-series forecasting models. These metrics include mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and mean absolute percentage error (MAPE). Understanding these metrics is important because they help you determine which forecasting model is the most accurate for your data.

Forecasting Horizon:
The forecasting horizon is the length of time into the future for which you want to make predictions. The forecasting horizon is important because it can affect the accuracy of your forecasts. Short-term forecasts are generally more accurate than long-term forecasts.

Introducing the dataset

To gain a deeper understanding of the concepts of time-series forecasting, we will apply essential principles to the “G-Research Crypto Forecasting” dataset (https://www.kaggle.com/competitions/g-research-crypto-forecasting/data), which is readily available on Kaggle. This will provide a practical and tangible example of the techniques involved.

Importing required libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import time
from datetime import datetime
import xgboost as xgb
from xgboost.callback import EarlyStopping
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor

Dataset description:

This dataset is a valuable resource for anyone looking to predict the future returns of crypto-assets such as Bitcoin and Ethereum. It consists of historic trades and provides a range of features such as timestamp, asset ID, count of trades, opening, closing, high and low prices, volume, and VWAP. The primary goal is to predict the 15-minute residualized returns using time series forecasting techniques.

Files

train.csv - The training set

timestamp - A timestamp for the minute covered by the row.
Asset_ID - An ID code for the crypto-asset.
Count - The number of trades that took place this minute.
Open - The USD price at the beginning of the minute.
High - The highest USD price during the minute.
Low - The lowest USD price during the minute.
Close - The USD price at the end of the minute.
Volume - The number of cryptoasset units traded during the minute.
VWAP - The volume weighted average price for the minute.
Target - 15 minute residualized returns. See the 'Prediction and Evaluation' section of this notebook for details of how the target is calculated.

example_test.csv - An example of the data that will be delivered by the time series API.
example_sample_submission.csv - An example of the data that will be delivered by the time series API. The data is just copied from train.csv.
asset_details.csv - Provides the real name and of the cryptoasset for each Asset_ID and the weight each cryptoasset receives in the metric.
gresearch_crypto - An unoptimized version of the time series API files for offline work. You may need Python 3.7 and a Linux environment to run it without errors.
supplemental_train.csv - After the submission period is over this file's data will be replaced with cryptoasset prices from the entire submission period.

df = pd.read_csv("C:\\Users\\hmitt\\Downloads\\g-research-crypto-forecasting\\train.csv")
df

asset_details = pd.read_csv("C:\\Users\\hmitt\\Downloads\\g-research-crypto-forecasting\\asset_details.csv")
asset_details

Our dataset contains 14 different cryptocurrencies which will be analyse dmoving forward.

Handling missing data

Handling and managing missing data in a time-series dataset is a critical step in ensuring accurate and reliable analysis. Here are some techniques that can be used to handle and manage missing data:

Imputation: This involves filling in the missing values with estimated values based on the patterns observed in the available data. There are several methods for imputation such as mean imputation, interpolation, and regression imputation.

Dropping missing values: In some cases, if the amount of missing data is relatively small, it may be reasonable to remove the rows with missing data. However, this approach should be used with caution as it can result in a loss of valuable information.

Using machine learning algorithms: Another approach to handling missing data is to use machine learning algorithms that can handle missing data, such as XGBoost and LightGBM.

Handling missing data in the model: Some models can handle missing data, such as the Kalman filter, which can estimate missing values in real-time.

In general, imputation is a common and effective method for handling missing data, but it can introduce bias and should be used with caution.

There are ~75k null values in our dataset which is a very small portion of the total data, approximately 3%. We can proceed to remove the null data values.

Exploring time series data properties

Analyzing time-series data can be challenging due to its unique characteristics such as seasonality, trend, and autocorrelation. Exploratory data analysis (EDA) is a crucial first step in understanding time-series data and making informed decisions.

Examining Seasonality:

Seasonality refers to the repeating patterns in the data that occur over regular intervals. For example, retail sales data may have higher values during the holiday season. Examining seasonality in time-series data can help understand the cyclical nature of the data and identify trends that are dependent on time of year. Let’s try to observe how a seasonal pattern looks like.

def identify_pattern (time_season):
    return np.where (time_season < 0.45, np.cos(time_season * 2 *np.pi), 1 / np.exp(3*time_season))

def identify_seasonality (time, period, phase, amplitude):
    time_season = ((time + phase) %  period ) / period
    return amplitude * identify_pattern(time_season)

time = np.arange(5*365 + 1 )

def plot_data (time , series , format = "-" , start = 0, end = None ,  label = None , color = None):
    plt.plot(time[start:end] , series[start:end] , format , color = color)
    plt.ylabel("Value")
    plt.xlabel("Time")
    if label:
        plt.legend(fontsize = 20)
    plt.grid(True)

series = identify_seasonality (time , period = 365, phase = 0, amplitude = 30)
plt.figure (figsize=(20,6))
plt.title ("#1 Seasonality Plot", fontdict = {'fontsize' : 22} )
plot_data (time , series, color = "red")
plt.show()

series = identify_seasonality(time , period = 90, phase = 25, amplitude = 80)
plt.figure(figsize=(20,6))
plt.title("#2 Seasonality Plot", fontdict = {'fontsize' : 22} )
plot_data (time , series, color = "red")
plt.show()

Above 2 graphs depict how a seasonal data looks like. If we observe similar characteristics in our dataset, we can state that our data is seasonal.

Analyzing Autocorrelation:

Time-series data often exhibits autocorrelation due to the temporal dependency of the data. Analyzing autocorrelation can help identify patterns in the data, such as periodicity, and can be used to build models that capture the time-series behavior. Let’s understand with help of an illustration.

def identify_1_autocorrelation (time , amplitude , seed = None):
    rand_var = np.random.RandomState (seed)
    rnd_array = rand_var.randn (len(time) + 50)
    a1 = 0.5
    a2 = -0.1
    rnd_array[:50] = 100
    for step in range (50, len (time) + 50 ):
        rnd_array[step] += a1 * rnd_array[step - 50]
        rnd_array[step] += a2 * rnd_array[step - 33]
    return rnd_array[50:] * amplitude

def identify_2_autocorrelation (time, amplitude, seed=None):
    rand_var = np.random.RandomState (seed)
    rnd_array = rand_var.randn (len(time) + 1)
    a1 = 0.8
    for step in range (1, len(time) + 1):
        rnd_array[step] += a1 * rnd_array[step - 1]
    return rnd_array[1:] * amplitude

my_series_auto = identify_1_autocorrelation (time , amplitude , seed = 20)
plt.figure (figsize= (20,6))
plt.title ("#1 Autocorrelation Plot", fontdict = {'fontsize' : 22} )
plot_data (time[:200] , my_series_auto[:200] , color = "red" )
plt.show ()

my_series_auto = identify_2_autocorrelation (time , amplitude , seed = 30)
plt.figure (figsize=(20,6))
plt.title ("#2 Autocorrelation Plot", fontdict = {'fontsize' : 22} )
plot_data (time[:200] , my_series_auto[:200] , color = "red")
plt.show()

This is how we can identify autocorrelation in a dataset. Please take note of the periodicity and repetitiveness in the pattern. If our dataset exhibits similar pattern, we can state that features are autocorrelated.

Identifying Correlation:

Correlation is often used to identify relationships between different time series and can help in predicting future trends. However, it’s important to note that correlation does not necessarily imply causation, and the relationship between two time series may be spurious.

There are several methods for measuring correlation in time-series data, including:

Pearson correlation coefficient: This measures the linear relationship between two time series. It assumes that the relationship between the two variables is linear and that the data is normally distributed.

Spearman rank correlation coefficient: This measures the relationship between two time series based on their ranks rather than their actual values. It’s less sensitive to outliers than Pearson correlation and can capture non-linear relationships.

Cross-correlation: This measures the similarity between two time series at different lags. It can help identify if there is a lagged relationship between two variables.

asset_names_dict = {row["Asset_Name"]:row["Asset_ID"] for ind, row in asset_details.iterrows()}

names_of_assets = [
    'Bitcoin',
    'Ethereum',
    'Cardano',
    'Binance Coin',
    'Dogecoin',
    'Bitcoin Cash',
    'Litecoin',
    'Ethereum Classic',
    'Stellar',
    'TRON',
    'Monero',
    'EOS.IO',
    'IOTA',
    'Maker'
]

my_assets_combined = pd.DataFrame([])
for ind, coin in enumerate(names_of_assets):
    coin_cryptocurrency_df = df[df["Asset_ID"]==asset_names_dict[coin]].set_index("timestamp")
    coin_cryptocurrency_df = coin_cryptocurrency_df.reindex (range (coin_cryptocurrency_df.index[0], coin_cryptocurrency_df.index[-1]+60,60), method='pad')
    coin_cryptocurrency_df = coin_cryptocurrency_df.loc[totimestamp ('01/07/2021'):totimestamp ('21/09/2021')]
    close_values = coin_cryptocurrency_df["Close"].fillna(0)
    close_values.name = coin
    my_assets_combined = my_assets_combined.join (close_values, how="outer")

correlation_matrix = my_assets_combined.corr()
fig, ax = plt.subplots (figsize=(14, 14))
sns.heatmap(correlation_matrix, vmax=1., cmap='coolwarm', square=True)
plt.title("Cryptocurrency correlation map on actual price values", fontsize = 18)
plt.show()

Inference: The correlation between various assets is evident, with even the lowest correlation being in the 0.4x range. Bitcoin and Ethereum, Cardano and Binance Coin, and Bitcoin Cash and Litecoin exhibit a strong correlation.

Visualizing the Time-Series Data

Time-series data can be plotted in various ways, such as line plots, scatter plots, and histograms. These plots can help identify the overall behavior of the data, such as whether it is stationary or non-stationary, and whether it has any outliers or gaps.

A candlestick graph, is a type of financial chart used to represent the price movement of an asset. It displays four pieces of information: the opening price, the closing price, the highest price, and the lowest price for a specific time period. Each candlestick represents a single time period, such as a day or an hour, and the color of the candlestick can indicate whether the price increased (green or white candle) or decreased (red or black candle) during that period.

def cryptocurrency_dataframe (asset_id ,data= df ):
    df = data[data["Asset_ID"]==asset_id].reset_index(drop = True)
    df['timestamp'] = pd.to_datetime (df['timestamp'], unit='s')
    df = df.set_index ('timestamp')
    return df

btc = cryptocurrency_dataframe(asset_id = 1)
eth = cryptocurrency_dataframe (asset_id = 6 )
ltc = cryptocurrency_dataframe(asset_id = 9 )

def crypto_candelstick(df, title):
    crypto_candle = go.Figure(data = [go.Candlestick(x =df.index, 
                                               open = df[('Open')], 
                                               high = df[('High')], 
                                               low = df[('Low')], 
                                               close = df[('Close')])])
    crypto_candle.update_xaxes(title_text = 'Time',
                             rangeslider_visible = True)

    crypto_candle.update_layout(
    title = {
        'text': '{:} Candelstick Chart'.format(title),
        'xanchor': 'center',
        'yanchor': 'top',
        'y':0.90,
        'x':0.5
        } , 
    template="plotly_white")

    crypto_candle.update_yaxes (title_text = 'Price in USD', ticksuffix = '$')
    return crypto_candle

bitcoin_candle = crypto_candelstick(btc[-100:],title = "Bitcoin(BTC)")
bitcoin_candle.show()

lightcoin_candle = crypto_candelstick(ltc[-2500:],title = "Litecoin(LTC)")
lightcoin_candle.show()

Inference: From 22:40 to 22:50, there was a sequence of red candles for BTC indicating a downward trend, which could suggest overselling. As per the theoretical assumption, the price may increase after an overselling period. Overselling implies that the price is trading below the typical or average range, which is sometimes referred to as the “true value” by investors. Typically, buyers are drawn to the market during an oversold period.

An area plot graph, displays the cumulative values of a variable over time. The area between the x-axis and the plotted line is typically filled with color, creating a stacked appearance. This type of graph is often used to show changes in a variable over time, such as sales or revenue. Area plot graphs can be helpful for identifying patterns in data, such as seasonal trends or changes in growth rates.

def traded_volume (data ,title,color):
    area_traded = px.area(data_frame=data,
               markers = True,
               x = data.index ,
               y = "Volume"
               )
    area_traded.update_traces (line_color = color)
    area_traded.update_yaxes (title_text = 'Number of trades every minute')
    area_traded.update_xaxes(
        title_text = 'Time',
        rangeslider_visible = True)
    area_traded.update_layout (showlegend = True,
        title = {
            'text': '{:} Volume Traded'.format(title),
            'xanchor': 'center',
            'yanchor': 'top',
            'y':0.94,
            'x':0.5,
            },
        template="plotly_white")
    return area_traded

traded_volume (eth[-50:], "Ethereum (ETH)",color = "Red")

Inference: The area plot graph for Ethereum volume traded over a 1-hour window does not reveal a clear trend. The graph appears to fluctuate frequently, with the volume increasing and decreasing at different intervals.

Let’s take a look at the training data visualization to observe the changes in the coin prices over time.

cmap = sns.color_palette()

# plot close values as time series for all the assets
f = plt.figure (figsize= (15,30))

for ind, crypto in enumerate (names_of_assets):
    cryptocurrency_dataframe = df[df["Asset_ID"]==asset_names_dict[crypto]].set_index("timestamp")
    # fill missing values 
    cryptocurrency_dataframe = cryptocurrency_dataframe.reindex(range(cryptocurrency_dataframe.index[0], cryptocurrency_dataframe.index[-1]+60,60), method='pad')
    ax = f.add_subplot(7,2,ind+1)
    plt.plot(cryptocurrency_dataframe['Close'], color=cmap[ind%10], label=crypto)
    plt.xlabel ('Time')
    plt.ylabel (crypto)
    plt.legend()
    plt.title (crypto)

plt.tight_layout()
plt.show()

Inference: It appears that there is a significant correlation among various cryptocurrency assets, which appears to vary across different timeframes. The dataset contains bear data from the 2018 timeframe, bull data from the 2021 timeframe, and sideways data from the 2019 and 2020 timeframes, allowing us to model different market conditions. Moreover, different cryptocurrencies have varying timelines in the provided graph. Therefore, we will narrow our focus to the data after July 1, 2021, and analyze the price movements of various assets.

# auxiliary function, from datetime to timestamp
totimestamp = lambda s: np.int32 (time.mktime (datetime.strptime (s, "%d/%m/%Y").timetuple()))

# create intervals
f = plt.figure(figsize=(15,30))

for ind, crypto in enumerate (asset_names):
    cryptocurrency_dataframe = df[df["Asset_ID"]==asset_names_dict[crypto]].set_index ("timestamp")
    # fill missing values 
    cryptocurrency_dataframe = cryptocurrency_dataframe.reindex(range (cryptocurrency_dataframe.index[0], cryptocurrency_dataframe.index[-1]+60,60),method='pad')
    cryptocurrency_dataframe = cryptocurrency_dataframe.loc[totimestam p('01/07/2021'):totimestamp ('21/09/2021')]
    ax = f.add_subplot(7,2,ind+1)
    plt.plot (cryptocurrency_dataframe['Close'], color=cmap[ind%10], label=crypto)
    plt.xlabel ('Time')
    plt.ylabel (crypto)
    plt.legend()
    plt.title (crypto)

plt.tight_layout()
plt.show()

Inference: The graph clearly depicts correlation among various coins, when plotted over the same time range. Similarities in trends in different currencies is a symbol of the cryptocurrency market trend as a whole.

Predictive modelling for time-series data

Time-series forecasting is a popular machine learning problem that involves predicting future values of a time-dependent variable based on its historical values. In this context, several machine learning models are commonly used for time-series forecasting. In this article, we will go through three popular models for time-series forecasting: ARIMA, XGBoost, and Light GBM. We will check-out the pros and cons of each model and analyse examples to illustrate their usage.

XGBoost (Extreme Gradient Boosting):

XGBoost is an ensemble model that combines multiple decision trees to make predictions. The model works by sequentially adding decision trees, each of which is trained to correct the errors made by the previous trees.

Pros:

XGBoost models can capture complex patterns in the time series data, including trend, seasonality, and cyclical behavior.
XGBoost models can handle both univariate and multivariate time series data.
XGBoost models are known for their accuracy and performance in a wide range of applications.

Cons:

XGBoost models can be computationally intensive and may require significant computational resources.
XGBoost models may not be as interpretable as other models, such as ARIMA.
XGBoost models may overfit the data if not properly regularized.

Example: XGBoost can be used for predicting energy consumption, as energy consumption often exhibits complex patterns that can be captured by XGBoost models.

Let’s understand in detail by training the XGBoost model on our dataset and using the same for cryptocurrency value prediction.

# Two new features
# Calculates the upper shadow of a financial asset.
def calculate_up_shadow (dataframe):
    return dataframe['High'] - np.maximum (dataframe['Close'], dataframe['Open'])

# Calculates the lower shadow of a financial asset.
def calculate_low_shadow (dataframe):
    return np.minimum (dataframe['Close'], dataframe['Open']) - dataframe['Low']

# A utility function to build features from the original dataframe. It works for rows to, so we can reutilize it.
# Extracts features from a pandas dataframe containing the data of a financial asset.
def get_features (dataframe):
    dataframe_feat = dataframe[['Count', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP']].copy()
    dataframe_feat ['calculate_up_shadow'] = calculate_up_shadow (dataframe_feat)
    dataframe_feat ['calculate_low_shadow'] = calculate_low_shadow (dataframe_feat)
    return dataframe_feat

dataframe['datetime'] = pd.to_datetime (dataframe['timestamp'], unit='s')
dataframe_train = dataframe [dataframe['datetime'] < '2021-06-13 00:00:00']
dataframe_test = dataframe [dataframe['datetime'] >= '2021-06-13 00:00:00']

print ("Number of samples in dataframe_train: ",len (dataframe_train))
print ("Number of samples in dataframe_test: ",len (dataframe_test))

# Calculates the logarithmic return of a series over a given period.
def log_return(series, periods=1):
    return np.log(series).diff(periods=periods)

def calculate_up_shadow(dataframe):
    return dataframe['High'] - np.maximum(dataframe['Close'], dataframe['Open'])

def calculate_low_shadow(dataframe):
    return np.minimum(dataframe['Close'], dataframe['Open']) - dataframe['Low']

def fill_nan_inf(dataframe):
    dataframe = dataframe.fillna(0)
    dataframe = dataframe.replace([np.inf, -np.inf], 0)
    
    return dataframe

# Features
my_featues = ["Count", "Open", "High", "Low", "Close", "Volume", "VWAP"]

# Creates a feature matrix from a pandas dataframe containing the data of a financial asset.
def create_features(dataframe, label=False):
    # Build features
    up_shadow = calculate_up_shadow(dataframe)
    low_shadow = calculate_low_shadow(dataframe)    
    five_mininmum_log = log_return(dataframe.VWAP, periods=5)
    absolute_one_minimum_log = log_return(dataframe.VWAP,periods=1).abs()    
    features = dataframe [my_featues]

    # Concat all the features into one dataframe
    X = pd.concat ([features, up_shadow, low_shadow, five_mininmum_log, absolute_one_minimum_log], axis=1)
    
    # Rename feature columns
    X.columns = my_featues + ["up_shadow", "low_shadow", "five_mininmum_log", "absolute_one_minimum_log"]
    
    # Fill NaN and Inf
    X = fill_nan_inf(X)
    
    if label:
        y = dataframe.Target
        y = fill_nan_inf(y)
        return X, y
    
    return X

# Get single crypto trading data
train_bitcoin = dataframe_train [dataframe_train.Asset_ID==1]
test_bitcoin = dataframe_test [dataframe_test.Asset_ID==1]

# Fill missing value
train_bitcoin = train_bitcoin.reindex (range(train_bitcoin.index[0],train_bitcoin.index[-1]+60,60),method='pad')
test_bitcoin = test_bitcoin.reindex (range(test_bitcoin.index[0],test_bitcoin.index[-1]+60,60),method='pad')

# Create features
dataframe_train, y_train = create_features (train_bitcoin, label=True)
X_test, y_test = create_features (test_bitcoin, label=True)

D_train = xgb.DMatrix (dataframe_train, label=y_train)
D_test = xgb.DMatrix (X_test, label=y_test)

xgb_params = {
    "n_estimators": 500,
    "learning_rate": 0.05,
    "subsample": 0.9,
    "max_depth": 11,
    "missing": -999,
    "colsample_bytree": 0.7,
    "random_state": 2023,
}

#Model training.
watchlist= [(D_train, "train")]
model = xgb.train (params=xgb_params, dtrain=D_train, num_boost_round = 80, evals = watchlist, verbose_eval = 10)

probability = model.predict(D_test)

# RMSE Computation
rmse = np.sqrt(mean_squared_error(y_test, probability))
print("RMSE : % f" %(rmse))

Light GBM:

Light GBM, short for Light Gradient Boosting Machine, is a gradient boosting framework developed by Microsoft. It is a type of decision tree ensemble model that is designed to handle large datasets and high-dimensional features. It uses a leaf-wise approach to grow the decision trees, which can result in better accuracy and faster training times than other gradient boosting algorithms.

Pros:

Efficient and fast training, especially on large datasets.
Can handle both numerical and categorical features.
Supports GPU acceleration for even faster training times.
Can handle missing values and outliers.
Can be used for both regression and classification tasks

Cons:

Prone to overfitting if not properly tuned.
Requires careful hyperparameter tuning for optimal performance.
May not perform as well as other models on smaller datasets.
Can be sensitive to noise and outliers in the data

Example: A common use case for Light GBM in time-series forecasting is predicting stock prices. For example, a financial institution might use Light GBM to forecast the price of a particular stock based on historical price data, as well as other relevant features such as news articles, economic indicators, and social media sentiment.

Let’s understand in detail by training the Light GBM model on our dataset and using the same for cryptocurrency value prediction.

# Initialize the LGBMRegressor model with specified hyperparameters
def training_model(X,y):
    # Training the LBGM model
    lgmb_model = LGBMRegressor (n_estimators=5000, learning_rate=0.1, num_leaves=500)
    # Fit the model on the input X and y data
    lgmb_model.fit(X, y)
    
    # Return the trained model
    return lgmb_model

# Function to extract features from input data
def features_extract (data):
    # Feature extraction for data as row of DataFrame
    
    features_dataframe = data [['Count', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP']].copy()
    features_dataframe ['Upper_Shadow'] = features_dataframe ['High'] - np.maximum (features_dataframe ['Close'], features_dataframe ['Open'])
    features_dataframe ['Lower_Shadow'] = np.minimum (features_dataframe ['Close'],features_dataframe ['Open']) - features_dataframe ['Low']
    
    features_dataframe ['high2low'] = features_dataframe ['High'] / features_dataframe['Low']
    features_dataframe ['volume2count'] = features_dataframe ['Volume'] / (features_dataframe['Count'] + 1)
    
    return features_dataframe

# Function to get X and y data for a given asset ID
def get_data_for_asset(dataframe, asset_id):
    # Get X and y
    
    dataframe = dataframe [dataframe["Asset_ID"] == asset_id]    
    df_proc = features_extract (dataframe)
    df_proc ['y'] = dataframe ['Target']
    df_proc = df_proc.dropna (how="any")
    y = df_proc["y"]
    X = df_proc.drop("y", axis=1)
    
    return X, y

%%time
models = {}
Xs = {}
ys = {}

for my_asset_id, asset_name in zip(asset_details['Asset_ID'], asset_details['Asset_Name']):
    print (f"Training model for {asset_name:<16} (ID={my_asset_id:<2})")
    X, y = get_data_for_asset (df, my_asset_id)    
    lgmb_model = training_model(X,y)
    Xs[my_asset_id], ys[my_asset_id], models[my_asset_id] = X, y, lgmb_model

# Check the model and it's possibility for the prediction 
print ("Check the model and it's possibility for the prediction")

x = features_extract(df.iloc[1])
y_pred = models[0].predict([x])
y_pred[0]

Conclusion

In summary, time-series forecasting is a dynamic field that requires a combination of technical skills and creative problem solving. Here are some key takeaways to keep in mind:

When working with time-series data, it’s important to understand the underlying patterns and trends, such as seasonality and cyclical patterns.
ARIMA models are a great starting point for time-series forecasting, especially when working with stationary data. They are simple to implement and interpret, and can be a good baseline for more complex models.
XGBoost and Light GBM are powerful machine learning algorithms that are highly effective at handling large and complex datasets with non-linear relationships. They are great options when working with non-stationary data or when seeking higher accuracy in predictions.
It’s important to carefully consider missing data and how to handle it, as it can have a significant impact on the accuracy of your models.
It’s important to select the appropriate evaluation metrics for your specific use case, and to understand the strengths and limitations of each metric.

Overall, time-series forecasting is a fascinating field with endless possibilities. By leveraging the right models and techniques, we can gain insights and make informed decisions that can drive real-world impact.

References

WandB.ai: https://wandb.ai/iamleonie/A-Gentle-Introduction-to-Time-Series-Analysis-Forecasting/reports/A-Gentle-Introduction-to-Time-Series-Analysis-Forecasting--VmlldzoyNjkxOTMz
Brownlee, J. (2021). Time Series Forecasting: A Tutorial for Beginners. Machine Learning Mastery: https://machinelearningmastery.com/time-series-forecasting/
https://www.kaggle.com/code/vbmokin/g-research-crypto-forecasting-baseline-fe
https://www.kaggle.com/code/odins0n/g-research-plots-eda

Take a quiz to test your knowledge now!