A comparison of different ways to analyse the time series of sunspots
When performing an analysis it is important to use a variety of different models to determine the one that will afford the best accuracy. I therefore conducted a time series analysis using models from statsmodels, facebook prophet and sklearn, all libraries that are compatible with Python, the programming language I used to conduct the analysis.
I decided to use a csv file of sunspots that were collected from1749 until 1983 because it had an extensive amount of data and it was apparent when plotting a graph of the sunspots that they occur in a cyclic fashion and it would therefore be easy to detect any trends.
The first thing I did was to load the libraries that I would need to carry out the analysis. I then loaded and read the csv file that I obtained from the following site:- https://machinelearningmastery.com/time-series-datasets-for-machine-learning/
I then decided to put the sunspot data on a graph because it is always a good idea to visualise time series data:-
I created a column called “Timestamp” from the “month” column. I also converted the index to the timestamp as well.
I created training and validation sets from the timestamp. Rather than splitting the train set up by percentages, I split it up based on times:-
I then plotted the newly created training and validation sets on a graph:-
Autoregression (AR)
I decided to use a variety of models in the statsmodel library and Autoregression (AR) was just one of them. Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. It is a very simple idea that can result in accurate forecasts on a range of time series problems.
I used the AR model, although there is a more recent revision to the model in statsmodels, AutoReg, but it could not be imported into Google Colab, the Jupyter notebook I use:-
When I used the AR model I plotted the result on the graph. As can be seen in the illustration below, the sunspot activity of the validation set started out in a cyclic fashion, but eventually petered out to form a straight line:-
The root mean squared (rms) error for Ar was 61.57, which was lower than the errors of the other models in statsmodels that I experimented with:-
Facebook Prophet
After experimenting analysing time series with statsmodels, I decided to pursue a function in another library: Facebook Prophet. Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. At its core, the Prophet procedure is an additive regression model with four main components: a piecewise linear or logistic growth curve trend designed on the Fourier series to provide a malleable model of periodic effects.
The First thing I did was to break down the timestamp into date, month and year to create features_and_targets dataset:-
I then had to break down the train file to form a training and testing dataset. Because this is a time series analysis, the train file was split by date to create a training and validation dataset:-
Facebook Prophet only wants to see two columns of data, so I had to drop several columns before I could proceed. The columns I dropped were ‘Month’, ‘Year’, ‘Month_Num’, and ‘Timestamp’:-
After preparing the dataset, I plotted a graph of both the training and validation datasets:-
Facebook Prophet wants to have the column names clearly defined, being ‘ds’ and ‘y’. I therefore changed the names of the columns in the training dataset named ‘Datetime’ and ‘Sunspots’, to ‘ds’ and ‘y’ respectively:-
Facebook Prophet follows the sklearn model API. An instance of the Prophet class is created and then fit() and predict() functions are called. The model is fit on the training dataset and it is predicted on the validation set:-
Once the data is fitted and predicted on, a graph is created to show the predicted values for the validation dates:-
A comparative graph is also created to overlay the predicted values over the actual values in the validation dataset.
The mean absolute error of FB Prophet is 49.65:-
Random Forest
Random Forest is widely used for classification and regression predictive modeling problems with structured datasets. According to the sklearn website, Random Forest is a meta estimator that fits a number of classifying decision trees on various subsamples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Having read that Random Forest can be used as a time series analysis tool, I decided to give this model a try.
I had to initially define my X and y variables. Because the independent values are based on time, the values had to be factorized and then reshaped.
I then split the train dataset for training and validating, deciding to use 15% of the dataset for validation purposes.
With Random Forest I achieved 98.43% accuracy on the training set and 90.23% accuracy on the validation set. In addition, a rms error of 14.32 was achieved:-
I also decided to graph the results I achieved in the modelling, fitting and prediction process to compare the predicted values against the actual values of the validation set:-
Conclusion
In conclusion, I believe Random Forest achieved the best error rate, followed by Facebook Prophet. Autoregression(AR) unfortunately performed the worst in this instance. It is important therefore to test every dataset on a variety of models to see which algorithm performs the best.
The computer program for this blog post can be found in its entirety in my personal Github account, with the link being found below:- https://github.com/TracyRenee61/Misc-Predictions/blob/main/TS_Sunspots_Statmodels_%26_Prophet.ipynb