Bitcoin Price Analysis — Part I: Classical Decomposition Tools
Hello there !! This is a project that me and my classmate Nilakshi Mondal have worked on for our 5th semester Time Series Paper as a part of our Bsc. Statistics degree from Delhi University. The entire project will be explained in two posts. This one will be on the tools used to understand trend, seasonality and cyclic nature. The next post will be on forecasting where we used Simple Exponential Smoothing, Holt’s Linear tren and ARIMA. We would love to know about any mistakes we made or any improvements that can be suggested since this is our first time working on a pure univariate time series dataset. Also the github repository for this project can be found here.
The Dataset
We used the Cryptocompare API for obtaining the dataset. We won’t go into the details regarding the API but it was fairly easy to use and the code can be found at the repository. The dataset we obtained contained hourly data for bitcoin from 2010–07 to 2019–09. We further truncated the data as we wanted to only work on the 2019 data.
We also split the data into train and test sets based on a 0.90 train-test-split ratio. The test data would be hidden from us till forecasting.
We will be concerned only with the column ‘open’ which tells us the ‘opening price’ of the bitcoin for every hour.
As we can clearly see from the data, it is not stationary, has a clear trend (or rather two trends: One from Jan till July and the other July onwards), might or might not have seasonality but has a lot of random fluctuations.
We shall start by analyzing the trend, move onto the seasonality and then the cyclic component.
Trend Analysis
The trend is indicative of the long term movement of the data. Usually a general increasing or decreasing trend can be found in the data which allows us to decide whether the data will be increasing with time or decreasing in the long term. Thus it helps in long term planning along with giving a rough estimate of future values.
For analyzing the trend of the data the moving average method is usually preferred. We have implemented from scratch a couple of trend curves and will be fitting all of them and comparing them on the basis of their MSE or Mean Squared Error.
We fitted the following curves:
- Straight Line
- Exponential
- Parabolic
- Second Degree Curve Fitted to Logarithms
- Logistic
- Moving Average
- Modified Exponential
- Gompertz
We have attached the resources one could consult to understand the curves and their fitting as hyperlinks above. But they won’t exactly be the same as what what we have coded. Functions and models differ in their representation from book to book and thus there will be slight differences almost always. Please message or comment if you want to get the exact form of the model and the theoretical solution used.
We have the curves fitted to the data as below:
Upon visual examination the moving average fits best to the data. However, to determine the best trend we perform the following two steps:
- Determine an appropriate moving average extend/order. One that doesn’t overfit or underfit the data.
- Calculate MSE(Mean Squared Error) for all the trends to see which one fits best.
To determine the order of the moving average, we plot it with the opening price with its order iterating over the range 51 to 1500 with intervals of 100. Meaning we compute the moving average for extents: 51, 151, 251…..1351, 1451 and then select the best one visually. We want one which neither underfits nor overfits the data.
On examining the graphs, we note the following:
- 51-MA to 351-MA more or less overfit the data and catch even very small fluctuations
- 451-MA to 751-MA are smoother and neither overfit nor underfit
- 851-MA onwards we have loss of quite a lot of values at the start and end of the series. The trend line keeps getting straighter as we increase the order. 1051-MA onwards we have a lot of values lost and thus won’t be considering them.
Thus, just on the basis of visual examination, we see the 651-MA to be a really good fit to the data without overfitting it and without losing too many values.
Now to choose between the moving average and the other trend curves and to select the most appropriate we could either visually examine or select the one with the least MSE(Mean Squared Error). Visually, it is clear that moving average fits the data the best. We calculate MSE only on the train dataset and we get the following the result:
Thus, we get the two best trends for our training data as:
- 651-MA trend curve
- Parabolic trend curve
Seasonality
We will be implementing some seasonality techniques to see if there is hourly seasonality in our data. Then we will be using seasonal_decompose from statsmodels to study seasonality further and decompose our data.
Below we have shown the graphs obtained on implementing each seasonality method. The x-axis has the hours. We are trying to study if each hour shows some seasonality or not, i.e, there will be 24 distinct seasonality units. To understand this further we are trying to understand the values of are dependent to a certain extent on the hour of the day. Maybe the price increases in the morning and then goes down by midnight or maybe around lunch the demand decreases etc.
The y-axis has the seasonal indices with 100 being no seasonality at all or the overall average/center.
- Ratio to Moving Averages
3. Simple Averages
4. Ratio to Trend
We see for all the above that hourly seasonality doesn’t exist. It is very negligible. The seasonality peaks max to about 0.25% but that’s too less. It’s not even 1%.
We now try to decompose the data using the python function “seasonal_decompose” in statsmodels that allows us to specify the type of model(additive/multiplicative) along with the frequency of the data we want to compute seasonality for. This method implements classical decomposition though ideally X11, SEATS or STL decomposition is preferred.
We make the following observations :
- There is no seasonality or cyclic component in our data. Thus, it is only the random component and trend which make up our data. (We say no cyclic component as we haven’t used data extending beyond an year so we can’t say if there is a cyclic component or not. Cyclic component will be discussed below soon enough)
- As we see below, the variance of the residual component keeps on increasing with time. and then starts decreasing around July onwards. This is very similar to how the trend behaved.
- We also tried out an additive model and it gave almost the same results as multiplicative model.
- See the freq = 24 * 7 argument which means that we consider each week to be a season and thus all weeks to have the same seasonality. That means we have 24*7(hours*days) number of individual units which are assumed to have individual seasonality.
- Note that seasonality increases/decreases by 1% which is too less. Thus the data can be said to not have significant seasonality just as our above methods showed.
Lastly, the remainder variable has a heteroskedastic nature. The variance of the remainder increases with time and then decreases. This remainder is made up of the cyclic element along with the error/random component. Thus we must check out the cyclic nature of the data. But since we have only data for under an year, we can’t explore cyclic variations. Thus any conclusions we draw right now should be checked with more past data.
Cyclic Nature
As we aren’t using data extending beyond an year, we can’t really study the cyclic nature of the data. But we still implement the method keeping in mind that we have hourly data for about 7 months.
We will be using a very crude but pretty commonly used technique : Residual analysis. This involves dividing the actual data by the trend values and seasonal indices and multiplying by 100**2 to get the cyclic and residual component. Then we apply a moving average to remove the random component by averaging it out to get the final cyclic component.
Here we see an analysis very similar to when we used the “seasonal_decompose” function python. The fluctuations increase with time and with months. This is true as our data has a lot of variation towards July and August.
This almost looks like it has some seasonality where there are two peaks followed by a huge fall. This cycle keeps repeating in our data.
Now, to explore this component a little more, consider the data starting from 2017 January till the end of our training dataset.
To calculate cyclic component for this data we need:
- Moving average : Trend
- Ratio to moving average : Seasonality
- Residual analysis : Cyclic component
From the given graph, we concluded that it is difficult to see if a cyclic component exists as there is too much of fluctuation of prices. There was a period of really high prices around Nov-Dec 2018 followed by a huge dip around Feb 2018. This is probably one of the only main defining period. After this period we see fluctuations decreasing and then increasing a little around Oct-Dec 2019 where we see a major fall in the prices. Fluctuations then reduced for some time but started increasing again with time.
There is a strong random component related with our data which keeps changing its variance. As the bitcoin, and cryptocurrencies in general, had a time when there was too much of hype around it, naturally there were periods of strong inflation of prices due to market sentiment and hype. But as the hype fell and the bubble burst, so did the prices of the bitcoin which explains the dips followed by huge peaks.
Also a lot of governments have banned trade of cryptocurrencies. This obviously affects market sentiment a lot. There will probably come a time when this fluctuation will settle around a price and then keep increasing or decreasing as cryptocurrencies get more acceptance or rejection.
That’s it for this article. You can check out the code here. The 2nd article will be uploaded soon with all the forecasting !