Time series modeling for forecasting returns on investment funds

Shafqaat Ahmad, PMP
Analytics Vidhya
Published in
12 min readAug 16, 2020

Time series analysis, discussed ARIMA, auto ARIMA, auto correlation (ACF), partial auto correlation (PACF), stationarity and differencing.

Source: Pixabay

Where is the data?

Most financial time series examples use built-in data sets. Sometimes these data do not attend your needs and you face a roadblock if you don’t know how to get the exact data that you need.

In this article I will demonstrate how to extract data directly from the internet. Some packages already did it, but if you need another related data, you will do it yourself. I will show you how to extract funds information directly from website. For the sake of demonstration, we have picked Brazilian funds that are located on CVM (Comissão de Valores Mobiliários) website, the governmental agency that regulates the financial sector in Brazil.

Probably every country has some similar institutions that store financial data and provide free access to the public, you can target them.

Downloading the data from website

To download the data from a website we could use the function getURL from the RCurlpackage. This package could be downloaded from the CRAN just running the install.package(“RCurl”) command in the console.

downloading the data, url http://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/

library(tidyverse) # a package to handling the messy data and organize it
library(RCurl) # The package to download a spreadsheet from a website
library(forecast) # This package performs time-series applications
library(PerformanceAnalytics) # A package for analyze financial/ time-series data
library(readr) # package for read_delim() function

#creating an object with the spreadsheet url url <- "http://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_202006.csv"

#downloading the data and storing it in an R object
text_data <- getURL(url, connecttimeout = 60)

#creating a data frame with the downloaded file. I use read_delim function to fit the delim pattern of the file. Take a look at it!
df <- read_delim(text_data, delim = ";")

#The first six lines of the data
head(df)
### A tibble: 6 x 8
## CNPJ_FUNDO DT_COMPTC VL_TOTAL VL_QUOTA VL_PATRIM_LIQ CAPTC_DIA RESG_DIA
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 00.017.02~ 2020-06-01 1123668. 27.5 1118314. 0 0
## 2 00.017.02~ 2020-06-02 1123797. 27.5 1118380. 0 0
## 3 00.017.02~ 2020-06-03 1123923. 27.5 1118445. 0 0
## 4 00.017.02~ 2020-06-04 1124052. 27.5 1118508. 0 0
## 5 00.017.02~ 2020-06-05 1123871. 27.5 1118574. 0 0
## 6 00.017.02~ 2020-06-08 1123999. 27.5 1118639. 0 0
## # ... with 1 more variable: NR_COTST <dbl>

Handling the messy

This data set contains a lot of information about all the funds registered on the CVM. First of all, we must choose one of them to apply our time-series analysis.

There is a lot of funds for the Brazilian market. To count how much it is, we must run the following code:

#get the unique identification code for each fund
x <- unique(df$CNPJ_FUNDO)

length(x) # Number of funds registered in Brazil.
##[1] 17897

I selected the Alaska Black FICFI Em Ações — Bdr Nível I with identification code (CNPJ) 12.987.743/0001–86 to perform the analysis.

Before we start, we need more observations to do a good analysis. To take a wide time window, we need to download more data from the CVM website. It is possible to do this by adding other months to the data.

For this we must take some steps:

First, we must generate a sequence of paths to looping and downloading the data. With the command below, we will take data from January 2018 to July 2020.

# With this command we generate a list of urls for the years of 2020, 2019, and 2018 respectively.

url20 <- c(paste0("http://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_", 202001:202007, ".csv"))
url19 <- c(paste0("http://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_", 201901:201912, ".csv"))
url18 <- c(paste0("http://dados.cvm.gov.br/dados/FI/DOC/INF_DIARIO/DADOS/inf_diario_fi_", 201801:201812, ".csv"))

After getting the paths, we have to looping trough this vector of paths and store the data into an object in the R environment. Remember that between all the 17897 funds, I select one of them, the Alaska Black FICFI Em Ações — Bdr Nível I

# creating a data frame object to fill of funds information
fundoscvm <- data.frame()

# Loop through the urls to download the data and select the chosen investment fund.
# This loop could take some time, depending on your internet connection.
for(i in c(url18,url19,url20)){
csv_data <- getURL(i, connecttimeout = 60)
fundos <- read_delim(csv_data, delim = ";")

fundoscvm <- rbind(fundoscvm, fundos)
rm(fundos)
}

Now we could take a look at the new data frame called fundoscvm. This is a huge data set with 10056135 lines.

Let’s now select our fund to be forecast.

alaska <- fundoscvm%>%
filter(CNPJ_FUNDO == "12.987.743/0001-86") # filtering for the `Alaska Black FICFI Em Ações - Bdr Nível I` investment fund.
# The first six observations of the time-series
head(alaska)
## # A tibble: 6 x 8
## CNPJ_FUNDO DT_COMPTC VL_TOTAL VL_QUOTA VL_PATRIM_LIQ CAPTC_DIA RESG_DIA
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12.987.74~ 2018-01-02 6.61e8 2.78 630817312. 1757349. 1235409.
## 2 12.987.74~ 2018-01-03 6.35e8 2.78 634300534. 5176109. 1066853.
## 3 12.987.74~ 2018-01-04 6.50e8 2.82 646573910. 3195796. 594827.
## 4 12.987.74~ 2018-01-05 6.50e8 2.81 647153217. 2768709. 236955.
## 5 12.987.74~ 2018-01-08 6.51e8 2.81 649795025. 2939978. 342208.
## 6 12.987.74~ 2018-01-09 6.37e8 2.78 646449045. 4474763. 27368.
## # ... with 1 more variable: NR_COTST <dbl>
# The las six observations...
tail(alaska)
## # A tibble: 6 x 8
## CNPJ_FUNDO DT_COMPTC VL_TOTAL VL_QUOTA VL_PATRIM_LIQ CAPTC_DIA RESG_DIA
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12.987.74~ 2020-07-24 1.89e9 2.46 1937895754. 969254. 786246.
## 2 12.987.74~ 2020-07-27 1.91e9 2.48 1902905141. 3124922. 2723497.
## 3 12.987.74~ 2020-07-28 1.94e9 2.53 1939132315. 458889. 0
## 4 12.987.74~ 2020-07-29 1.98e9 2.57 1971329582. 1602226. 998794.
## 5 12.987.74~ 2020-07-30 2.02e9 2.62 2016044671. 2494009. 2134989.
## 6 12.987.74~ 2020-07-31 1.90e9 2.47 1899346032. 806694. 1200673.
## # ... with 1 more variable: NR_COTST <dbl>

The CVM website presents a lot of information about the selected fund. We are interested only in the fund share value. This information is in the VL_QUOTAvariable. With the share value, we could calculate several financial indicators and perform its forecasting.

The data dimension is 649, 8. The period range is:

# period range:
range(alaska$DT_COMPTC)
## [1] "2018-01-02" "2020-07-31"

Let´s see the fund share price.

ggplot(alaska, aes( DT_COMPTC, VL_QUOTA))+ geom_line()+
scale_x_date(date_breaks = "3 month") +
theme(axis.text.x=element_text(angle=60, hjust=1))

COVID-19 brings a lot of trouble, doesn’t it?

A financial raw series probably contains some problems in the form of patterns that repeat overtime or according to the economic cycle.

A daily financial time series commonly has a 5-period seasonality because of the trading days of the week. Some should account for holidays and other days that the financial market does not work. I will omit these cases to simplify the analysis.

We can observe in the that the series seems to go upward during a period and down after that. This is a good example of a trend pattern observed in the data.

It is interesting to decompose the series to see “inside” the series and separate each of the effects, to capture the deterministic (trend and seasonality) and the random (remainder) parts of the financial time-series.

Decomposing the series

A time-series can be decomposed into three components: trend, seasonal, and the remainder (random).

There are two ways that we could do this: The additive form and the multiplicative form.

Let y_t be our time-series, T_t represents the trend, S_t is the seasonal component, and R_t is the remainder, or random component.
In the additive form, we suppose that the data structure is the sum of its components:

In the multiplicative form, we suppose that the data structure is a product of its components:

These structures are related. The log of a multiplicative structure is an additive structure of the (log) components:

Setting a seasonal component of frequency 5, we could account for the weekday effect of the fund returns. We use five because there is no data for the weekends. You also should account for holidays, but since it vary according to the country on analysis, I ignore this effect.

library(stats) # a package to perform and manipulate time-series objects
library(lubridate) # a package to work with date objects
library(TSstudio) # A package to handle time-series data

# getting the starting point of the time-series data
start_date <- min(alaska$DT_COMPTC)

# The R program does not know that we are working on time-series. So we must tell the software that our data is a time-series. We do this using the ts() function

## In the ts() function we insert first the vector that has to be handle as a time-series. Second, we tell what is the starting point of the time series. Third, we have to set the frequency of the time-series. In our case, it is 5.
share_value <- ts(alaska$VL_QUOTA, start =start_date, frequency = 5)

# the function decompose() performs the decomposition of the series. The default option is the additive form.
dec_sv <- decompose(share_value)

# Take a look!
plot(dec_sv)

Decomposition of additive time series.

The above figure shows the complete series observed, and its three components trend, seasonal, and random, decomposed in the additive form.

The below figure shows the complete series observed, and its three components trend, seasonal, and random, decomposed in the multiplicative form.

The two forms are very related. The seasonal components change slightly.

dec2_sv <- decompose(share_value, type = "multiplicative")

plot(dec2_sv)

In a time-series application, we are interested in the return of the fund. Another information from the related data set is useless to us.

The daily return (Rt) of a fund is obtained from the (log) first difference of the share value. In your data, his variables call VL_QUOTA. The logs are important because give to us a proxy of daily returns in percentage.

The algorithm for this is:

R1 <- log(alaska$VL_QUOTA)%>% # generate the log of the series
diff() # take the first difference

plot.ts(R1) # function to plot a time-series

The return (log-first difference) series takes the role of the remainder in the decomposition did before.

The random component of the decomposed data is seasonally free. We could do this by hand taking a 5-period difference:

R5 <- R1 %>% diff(lag = 5)
plot.ts(R5)

The two graphs are similar. But the second does not have a possible seasonal effect.

We can plot a graph to identify visually the seasonality of the series using the ggsubseriesplot function of the forecastpackage.

ggsubseriesplot(share_value)

Oh, it seems that our hypothesis for seasonal data was wrong! There is no visually pattern of seasonality in this time-series.

The Forecasting

Before forecasting, we have to verify if the data are random or auto-correlated. To verify autocorrelation in the data we can first use the autocorrelation function (ACF) and the partial autocorrelation function (PACF).

Autocorrelation Function and Partial Autocorrelation Function

If we use the raw data, the autocorrelation is evident. The blue line indicates that the significance limit of the lag’s autocorrelation. In the below ACF figure, the black vertical line overtaking the horizontal blue line means that the autocorrelation is significant at a 5% confidence level. The sequence of upward lines indicate positive autocorrelation for the tested series.

To know the order of the autocorrelation, we have to take a look on the Partial Autocorrelation Function (PACF).

library(TSstudio)

R <- alaska$VL_QUOTA #extract the fund share value from the data
Racf <- acf(R, lag.max = 12) # compute the ACF

Autocorrelation Function for the fund share value.

Rpacf <- pacf(R, lag.max = 12) # compute the PACF

The Partial Autocorrelation Function confirms the autocorrelation in the data for lags 1, 6,and 8.

At the most of time, in a financial analysis we are interested on the return of the fund share. To work with it, one can perform the forecasting analysis to the return rather than use the fund share price. I add to you that this procedure is a good way to handle with the non-stationarity of the fund share price.

So, we could perform the (partial) autocorrelation function of the return:

library(TSstudio)

Racf <- acf(R1, lag.max = 12) # compute the ACF

ACF for the fund share (log) Return.

Racf <- acf(R5, lag.max = 12) # compute the ACF

Above ACF figure suggests a negative autocorrelation pattern, but is impossible identify visually the exactly order, or if exists a additional moving average component in the data.

Rpacf <- pacf(R1, lag.max = 12) # compute the PACF

PACF for the fund share (log) Return.

Rpacf <- pacf(R5, lag.max = 12) # compute the PACF

autocorrelation function of the Return does not help us so much. Only that we can set is that exist a negative autucorrelation pattern.

One way to identify what is the type of components to be used ( autocorrelation or moving average). We could perform various models specifications and choose the model with better AIC or BIC criteria.

Forecasting with ARIMA models

One good option to perform a forecasting model is to use the family of ARIMA models. For who that is not familiar with theoretical stuff about ARIMA models or even Autocorrelation Functions, I suggest the Brooks books, “Introductory Econometrics for Finance”.

To model an ARIMA(p,d,q) specification we must to know the order of the autocorrelation component (p), the order of integration (d), and the order of the moving average process (q).

OK, but, how to do that?

The first strategy is to look to the autocorrelation function and the partial autocorrelation function. since we cant state just one pattern based in what we saw in the graphs, we have to test some model specifications and choose one that have the better AIC or BIC criteria.

We already know that the Returns are stationary, so we do not have to verify stationary conditions (You do it if you don’t trust me :).

# The ARIMA regression
arR <- Arima(R5, order = c(1,0,1)) # the auto.arima() function automatically choose the optimal lags to AR and MA components and perform tests of stationarity and seasonality

summary(arR) # now we can see the significant lags
## Series: R5
## ARIMA(1,0,1) with non-zero mean
##
## Coefficients:
## ar1 ma1 mean
## -0.8039 0.6604 0.0000
## s.e. 0.0534 0.0636 0.0016
##
## sigma^2 estimated as 0.001971: log likelihood=1091.75
## AIC=-2175.49 AICc=-2175.43 BIC=-2157.63
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 9.087081e-07 0.04429384 0.02592896 97.49316 192.4589 0.6722356
## ACF1
## Training set 0.005566003

If you want a fast analysis than perform several models and compare the AIC criteria, you could use the auto.arima() functions that automatically choose the orders of autocorrelation, integration and also test for season components in the data.

I use it to know that the ARIMA(1,0,1) is the best fit for my data!!

It is interesting to check the residuals to verify if the model accounts for all non random component of the data. We could do that with the checkresiduals function of the forecastpackage.

checkresiduals(arR)# checking the residuals
 ## 
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1) with non-zero mean
## Q* = 186.44, df = 7, p-value < 2.2e-16
##
## Model df: 3. Total lags used: 10

Look to the residuals ACF. The model seem to not account for all non-random components, probably by an conditional heteroskedasticity. Since it is not the focus of the post, you could Google for Arch-Garch family models

Now, the forecasting!

Forecasting can easily be performed with a single function forecast. You should only insert the model object given by the auto.arima() function and the periods forward to be foreseen.

farR <-forecast(arR, 15)

## Fancy way to see the forecasting plot

plot_forecast(farR)

The function plot_forecast of the TSstudio package is a nice way to see the plot forecasting. If you pass the mouse cursor over the plot you cold see the cartesian position of the fund share.

Now you can go to predict everything that you want. Good luck!

--

--

Shafqaat Ahmad, PMP
Analytics Vidhya

Working as a Data Scientist for manufacturing/sales industry in Canada.