Forecasting Crypto Portfolios Like a Quant

How to easily use SARIMAX models to forecast your portfolio?

ShengKai Chen
The Power of AI
10 min readJan 9, 2023

--

Image resource: PxHere

Interesting how to build this ML model in a well prepare programming environment? Click here to build this model step-by-step with CognitiveClass.ai Guided Project

At least $1 billion of client funds are missing at failed crypto firm FTX, and their FTX token has lost most of its value in November 2022. How would you prevent your portfolio from having massive losses in a black swan event?

This guided project will help you understand the method of cleaning data and how the big financial companies create popular indexes like the S&P 500 or Nasdaq. Most importantly, how to create your portfolio index that contains diverse cryptocurrencies to track your performance and use Machine Learning to forecast the index movement in the near future.

The purpose is to help the beginner who knows a bit of time series, but has a hard time processing real-world datasets can quickly fill the gap with this guided project. I hope everyone can find something useful here and enjoy it.

Loading Datasets

First, let us get each cryptocurrency’s closing price, including Bitcoin, Binance Coin, Dogecoin, Ethereum, USD Coin, Tether, XRP, and the FTX Token since 2010 from Coincodex.

Note: All the price used in the data is based on the exchange rate between the target currency and USD.

import pandas as pd

coins = pd.read_csv("/cryptocurrency/coins.csv", sep=",", header=0)
coins.set_index("Date", inplace=True) # make the date become the index
coins = coins.sort_index()

coins

Then we need to find out some external factors that might affect the price of the cryptocurrencies. Let us get some command indices from the market, like the S&P 500, Nasdaq, gold, and silver, that started in 2018. We can also add some popular economic indicators like Daily Treasury Par Yield Curve Rates from the U.S. Department of the Treasury, or CPI and PSR, but be aware that the CPI and PSR data time sequence is monthly based.

factors = pd.read_csv("/cryptocurrency/predictor_variables.csv", sep=",", header=0)
factors.set_index("Date", inplace=True) # make the date become the index
factors = factors.sort_index()

factors

Data Cleaning

When we receive time series data, it is only sometimes in the required format. Most of the raw data either needs to be more organized or contains lots of missing values or dates, which makes it impossible for us to train the models. Therefore, understanding how to properly clean and correctly prepare the data is one of the significant skills for a data scientist.

Cleaning Missing Values

First, let us clear null values are in the crypto data.

coins = coins.dropna()

coins

Filling Missing Values

We cannot use the same method on the external factors since the data is in different time sequence based. Thus, we need to find and set the first dates of the data that will match with all the cryptos. From the previous result, we know we can set our date as 2019/08/01. To transform this data into helpful information, we will first assume that during the missing days, most of them are either weekends or holidays, and the values remain the same from the day before. Based on this assumption, we can fill the missing date time with the previous day’s values.

factors = factors["2019-08-01":]
factors = factors.reindex(pd.date_range("2019-08-01", "2022-11-15")).reset_index().rename(columns={"index": "Date"})
factors = factors.groupby(factors["Date"].dt.time).ffill() # fill the missing date values
factors.set_index("Date", inplace=True)

factors

Creating Customize Index

We want to predict the index in this project, and at some time 𝑡t, we will use a straightforward index method called “Equal-Weighted Index.” This is just the average of the cryptos.

Another method is called the ”Capitalization-Weighted Index,” which can customize the weight of each crypto based on your portfolio. If you are eager to use this technique on your portfolio, you can find the detailed clarification in CognitiveClass.ai

Equal weight is a type of proportional measuring method that gives the same importance to each crypto in a portfolio, index, or index fund. So the smallest crypto is given equal statistical significance, or weight, to the largest crypto when it comes to evaluating the overall group’s performance. The following equation can help us to achieve that goal.

Equation of Equal-Weighted Index
  • Index Value (V): refer to the equal-weighted index.
  • Price(P): refer to the price of the crypto.
  • Weight(W): refer to assigned weight, but in an equal-weighted index, each weighting is 1/N, with N being the number of crypto within the index.
result = []

# calculate the index value
for i in range(len(coins.columns)):
coin = coins[coins.columns[i]] / len(coins.columns)
result.append(coin)
# assign index value with date
ew_index = pd.DataFrame(1 + pd.DataFrame(pd.concat(result, axis=1)).sum(axis=1))
ew_index.set_axis([*ew_index.columns[:-1], "Index"], axis=1, inplace=True)

ew_index.tail(5)

Here we can use seaborn package to view each month's interquartile range of the index.

import seaborn as sns
import matplotlib.pyplot as plt

ts_fig, ts_ax = plt.subplots(figsize=(36, 9))
sns.boxplot(x=ew_index.index.strftime("%Y-%b"), y=ew_index.Index, ax=ts_ax)
ts_ax.set_xlabel("Month", labelpad=9, fontsize=15)
ts_ax.set_ylabel("Total Rides", labelpad=9, fontsize=15)
ts_ax.set_xticklabels(ts_ax.get_xticklabels(), rotation=90)
ts_ax.set_title("Monthly Index", fontsize=21)
plt.show()

Data Preprocessing

Before we train the index data, we have to normalize the data and check the correlation between our index and external factors. Let us first combine the index values and predictor variables into one dataframe.

data = factors.merge(ew_index, how="left", left_on="Date", right_on="Date")

Normalizing Variables

Now we finally have clean data that is ready to use. Before we start training the model with these data, one of the significant steps is ensuring all the data is normalized. Since both our target variables and predictor variables are measured on different scales, we must adjust values to a notionally standard scale in this section. There are many ways to achieve it, but this project will use one of the simplest methods, Min-Max Normalization, which consists of rescaling the range of features to scale the range in [0, 1].

from sklearn.preprocessing import MinMaxScaler

data_nor = pd.DataFrame(MinMaxScaler().fit_transform(data)).assign(label=data.index) # normalized the data with min-max scaling
data_nor.columns = data.columns.to_list() + ["Date"]
data_nor.set_index("Date", inplace=True)

data_nor.tail(5)

Finding Correlation Between Variables

Another necessary step before we start training the model with this data is to check the correlation between our target variables and predictor variables. We will use the Pearson correlation coefficient to analyze the correlation, and based on the following figure, we will remove the weak correlation variables (correlation coefficient between -0.2 to 0.2).

Pearson Correlation Coefficient Scale
data_nor.corr(method="pearson").style.background_gradient(cmap="coolwarm", axis=None).set_precision(2)
cor = data_nor.corr(method="pearson")
data_nor.drop(data_nor.columns[(cor.Index >= -0.2) & (cor.Index <= 0.2)], axis=1, inplace=True) # remove the weak corellation (% between -0.2 to 0.2)

data_nor.tail(5)

Time Series Forecasting

Training and Testing Sets

In machine learning, randomly splitting the data train/test is normal because there’s no dependence from one observation to the other. Yet, It’s not the case for the time series data. As we mentioned earlier, our forecasting models are based on Autoregression analysis, which means the time series data Y(t+h) is correlated with its historical value Y(t). Here, we’ll want to use values at the rear of the dataset for testing and everything else for training.

Based on that assumption, we had 1203 records at daily intervals (almost 4 years); a good approach would be to keep the first 1143 records (3.1 years) for training and the last 60 records (2 months) for testing.

train_size = int(len(df) * 0.9505)
X_train, y_train = pd.DataFrame(df.iloc[:train_size, :-1]), pd.DataFrame(df.iloc[:train_size, -1])
X_test, y_test = pd.DataFrame(df.iloc[train_size:, :-1]), pd.DataFrame(df.iloc[train_size:, -1])

How To Determine Parameters d? Stationary Detection

Stationary is a factor that describes the predictabilities of the time series data. The strict stationery describes the entire probability distribution as timeshift-invariant, and the weak stationary informs the mean and covariance are timeshift-invariant, which means the t moment of the value is highly dependent on its history. We’ll use the function stationary to check whether the time series data is stationary.

ori_df = y_train  # original time series
fir_df = y_train.diff().dropna() # first difference time series
sec_df = y_train.diff().diff().dropna() # second difference time series
stationary_test = None
for i in range(3):
if i == 0:
print("Original Time Series")
stationary_test = adfuller(ori_df)
elif i == 1:
print("First Order Differencing")
stationary_test = adfuller(fir_df)
elif i == 2:
print("Second Order Differencing")
stationary_test = adfuller(sec_df)

print("ADF Statistic: %f" %stationary_test[0])
print("p-value: %f\n" %stationary_test[1])

Implementing SARIMAX Model

The SARIMAX model’s function is similar to the ARIMA model but adds two other elements: seasonality and external factors.

The key takeaway is that SARIMAX requires not only the p, d, and q arguments that ARIMA requires, but it also requires another set of P, D, and Q arguments for the seasonality aspect as well as an argument called “m.” It is the periodicity of the data’s seasonal cycle; in other words, it is the number of periods in each season.

When choosing an m value, try to get an idea of when the seasonal data cycles. If a monthly basis separates your data points and the seasonal cycle is a year, then set m to 12. Or if a daily basis separates the data points and the seasonal cycle is a week, then make s equal to 7. Here is the table you can reference to adjust the m parameters.

What we need to do next is to find out the best parameters that suit our model. In this step, we will use one of the most powerful tools called auto_arima to help us find the p, q, P, and Q. We don’t need to find d through this tool because we already did the stationary test in the previous section, which means the d and D can be set at 1. Don’t forget to put the X_train into exogenous and seasonal to True.

from pmdarima import auto_arima

sarimax_param = auto_arima(y_train, exogenous=X_train, m=7, start_p=0, d=1, start_q=0, start_P=0, D=1, start_Q=0, max_p=3, max_q=1, max_P=3, max_Q=1, trace=True, seasonal=True)

Based on the result, we found that the best parameters for SARIMAX models are when p is 1, q is 0, P is 3, Q is 0. Then we can put these parameters into the model and start to train our model. In this stage, we will put the training dataset into the SARIMAX and the parameters we got from auto_arima into order and seasonal parameters into seasonal_order.

from statsmodels.tsa.statespace.sarimax import SARIMAX

algorithm = SARIMAX(endog=y_train, exog=X_train, order=sarimax_param.get_params()["order"], seasonal_order=sarimax_param.get_params()["seasonal_order"])
model = algorithm.fit(disp=False)

Finally, we can use the trained model to predict the testing data. Here we need to set the start and end parameters as the length of the testing data and put all the external factors in the exog parameter. Then we can check the error rate of the model to ensure the model's performance.

from sklearn.metrics import mean_absolute_error, mean_squared_error

# forecast the data
forecast = model.get_prediction(start=len(y_train), end=len(y_train)+len(y_test)-1, exog=X_test, dynamic=True)
prediction = forecast.predicted_mean
ci = forecast.conf_int()

# check error rate
mse = mean_squared_error(y_test, prediction, squared=False)
rmse = mean_squared_error(y_test, prediction, squared=True)
print("The error rates of the SARIMAX forecasting are: \nMSE = %f \nRMSE = %f" %(mse, rmse))

Let us compare the forecasting results with the reality in the plot.

plt.figure(figsize=(24, 9))
plt.plot(y_test.index, y_test, label="observation")
plt.plot(prediction.index, prediction, label="prediction")
plt.fill_between(ci.index, ci.iloc[:, 0], ci.iloc[:, 1], color="k", alpha=0.2)
plt.ylim([0.18, 0.3])
plt.title("SARIMAX Model Prediction", fontsize=21)
plt.legend()
plt.show()

When we look at the chart, we can tell the prediction is very close to the actual market movements and also catch the randomness in the forecasting except on November 11 when FTX suddenly filed for bankruptcy protection.

That’s it!

However, don’t stop reading now! This hands-on only introduces the time series forecasting methods. Uncover how to build a Capitalization-Weighted Index like the S&P 500 or Nasdaq for your portfolio to track the crypto industry in this in-depth and Free CognitiveClass.ai Guided Project. Let’s upgrade your knowledge today.

Feel free to connect with me on LinkedIn as well!

WARNING: THIS GUIDED PROJECT IS NOT FINANCIAL ADVICE
The information contained on this Website and the resources available for download through this website are not intended as, and shall not be understood or construed as, financial advice. The information contained on this Website is not a substitute for financial advice from a professional who is aware of the facts and circumstances of your individual situation.

--

--

ShengKai Chen
The Power of AI

Shengkai is a data scientist at IBM with experience in analyzing data for retail stores. He is enrolled at the University of Toronto’s Faculty of Information.