Bitcoin and financial markets: analysing relationships using Python — Part 2

Anna Grigoryeva-Trier
7 min readMar 9, 2022

--

Crypto markets, while still being only a fraction of global financial markets, have demonstrated a truly expansive growth in the last couple of years. Investing in cryptocurrencies became somewhat mainstream among retail investors in many countries. A growing number of institutional investors also reach out to crypto, to diversify their portfolios and benefit from speculatively high returns.

In this context, crypto assets can be considered on equal footing with other financial asset classes when designing a portfolio. Here, I apply data science tools to analyse performance and uncover potential relationships between crypto and traditional financial assets. In Part 1, I look at returns, volatility and correlations. In Part 2, I fit some models to the data.

Normality matters (sometimes)

One of the assumptions of the linear regressions is normality. Here comes one common misconception: linear regression does not assume the normal distribution of data, but the normality of residuals. So normality does matter but there is no need to transform the data if it is not normally distributed.

It is useful though to look at the data to detect possible outliers, as they can significantly affect the results. One simple way is to visualise the distributions.

Fig.1. Data distribution

It looks like the data is not normally distributed at all, and there quite a few of the extreme values, or outliers.

There are different approaches to removing outliers in non-normally distributed data, one of them is simply applying the method used for normally distributed data, i.e. to remove the observations outside 3 standard deviations from the mean.

Fig.2. Data distribution after removing the outliers

Removing the outliers provides tighter graphs, though the distributions still do not look normal. This is confirmed by skewness and kurtosis (normal distribution has kurtosis of 3 and skewness of 0).

Most of the selected returns (except crude oil) have platykurtic distributions, i.e. have short tails. Some of the returns are negatively skewed (like Hedge Funds and SP500), i.e. have a long tail in the negative, while others are positively skewed (Bitcoin, Bonds, Oil).

Python Code

To remove the outliers, I use zscore() function of the scipy.stats package.

Plotting distributions, calculating skewness and kurtosis can be done using corresponding pandas functions.

Modelling

Here I fit different models on my data and see what I can uncover. Note, that I am not building a predictive model, it cannot be used to forecast the Bitcoin returns based on previous data. The inputs in the model are simultaneous, so the resulting coefficients reflect co-movements of the variables. To build a predictive model, time lags should be introduced.

Linear regression

Simple linear model is like a ‘hammer’ in the data science toolbox. It should never be looked down at though, it can be very helpful as a first step of data modelling. It also provides easily interpretable results. Linear models usually do not consume a lot of computing power and sometimes simply work the best considering all the constraints of the analysis.

I assign Bitcoin returns to the dependent variable, and all the rest of the assets — to the independent variables. Fitting linear regression means looking for the coefficients that minimise the sum of the squared residual errors, or distances between predicted and real values of the dependent variable.

The coefficients’ sign and value point at the direction and strength of the connection between the dependent and independent variables. Intercept is the value of the dependent variable when all independent variables are equal to zero (in a theoretical scenario). The P-value is the probability of the null hypothesis that the coefficient is equal to zero. If the P-value is small (usually less than 0.01 or 0.05) then the null hypothesis can be rejected, and the coefficient is considered significant.

There are several ways to evaluate how well a linear model approximates reality. One of the most common ones is calculating the R-squared (R²). It measures a proportion of variance of a dependent variable that can be explained by independent variables included in the model. The higher the R² — the better the model.

Another common measure is the RMSE — Root Mean Square Error — which is a standard deviation of the prediction errors (distances between predicted and actual values of the dependent variable). The smaller RMSE is — the closer predicted values to the real ones.

All coefficients in my simple linear model except the one for Ether are not significant. Despite providing a quite high R² (0.62), the simple linear regression fails to uncover any interesting connections between the variables.

Possible reasons include absence of linear relationships between the variables in reality, as well as possible multicollinearity (several independent variables are highly correlated) in the inputs.

Python Code

I assign dependent and independent variables to the corresponding datasets and divide these datasets into train and test sets using train_test_split() function from the sklearn.model selection package.

I fit a LinearRegression() from the sklearn.linear_model package to the train set and extract coefficients and intercept.

RMSE as well as some other metrics can be calculated using corresponding functions from the sklearn.metrics package.

Lasso regression

One way to improve simple linear regression is to remove independent variables one by one trying to find the best combination. This can be a time-consuming process. Instead, a simpler solution is to use the Lasso regression.

Lasso regression is a linear regression with the objective function that includes an L1 regularisation term. It penalises small coefficients, technically dragging them to zero. As a result, Lasso regression is especially useful when a model has a large number of possibly insignificant variables and possible multicollinearity.

Lasso regression has an important hyperparameter — alpha. It is a weight of the regularisation term in the objective function. To achieve the best results, this hyperparameter requires tuning, for instance, using the GridSearch method. In my model, the optimal alpha value is quite low — 0.00002.

Lasso emphasised three independent variables: Ether, Hedge Funds and Oil. Ether and Hedge Funds are well expected, while Oil is somewhat a surprise here. Possible reasons include spurious data, or a false positive due to the very large oil movements.

Otherwise, Lasso regression has not really improved the R² or RMSE, compared to a simple linear model.

Python Code

Python sklearn.linear_model package has a LassoCV() function combining Lasso regression and the grid search for the best hyperparameter value.

Random Forest Regression

Random Forest Regression is a machine learning algorithm that uses an ensemble learning method, i.e. combines predictions from different machine learning algorithms to arrive at the best result. Unlike linear regression, it can also uncover non-linear relationships and is more appropriate to use for the non-normally distributed data.

Fitting a Random Forest Regressor with 1000 ‘trees’ to my input data provides an R-squared of 0.93 which is a significant improvement compared to the linear models.

The contribution of the independent variables to the model can be measured with the feature importance.

Fig.3. Feature importance

As in linear models, Ether returns contributed the most compared to all other assets. Among conventional financial assets, Hedge Funds and Tech show the highest importance.

Python Code

Python package for ensemble machine learning models — sklearn.ensemble — contains RandomForestRegression() that can be fitted directly to the train data.

Feature importance can be plotted using barh() function (produces horizontal bar plots) from matplotlib.pyplot package.

Concluding remarks

Disregarding the regulatory and other risks related to crypto currency, data points at significant potential of crypto assets when used to diversify a portfolio. In addition, big crypto assets like Bitcoin and Ether have demonstrated impressive performance in the last two years, even considering high volatility.

In the future, with further maturation of the crypto markets their mutual connections with the traditional financial markets might significantly increase, and the correlations between the assets can increase as well. At the same time, even now the crypto landscape is not dominated by Bitcoin and Ether as much as before. New assets and projects are challenging the industry, and the crypto market does not move in complete synchronicity any more. This creates new interesting opportunities for diversification within the crypto universe. It will be an exciting task for data scientists and investors to dig into these opportunities.

--

--