Forecasting with R: Trends and Seasonality

Useful Functions to Decomposition to Autocorrelation

Jake Batsuuri
Computronium Blog
8 min readJan 6, 2020

--

R is a statistical computing language. What does that mean? Much like the wave-particle duality, R can be thought of as a programming language and a “Swiss army knife” calculator tool.

It’s a statistical analysis tool, used in finance, machine learning and anywhere we need to deal with quantifiable uncertainties.

Before we delve into the nuts and bolts of forecasting, we’ll do a quick mind refresher of all basic R stuff.

R as a Calculator

Relatively straightforward, just like using the computer calculators. A little more useful in fact, since you can do expressions and you get lots of additional math functions.

# ADDITION
> 2+2
[1] 4
# EXPRESSION
> (2+2)/4_1
Error: unexpected input in "(2+2)/4_"
> (2+2)/4-1
[1] 0
# WEIRD MATH
> 1/0
[1] Inf
> 0/1
[1] 0
> 0/0
[1] NaN
# MATH FUNCTIONS
> sqrt(3)
[1] 1.732051
> 4+5^2
[1] 29
> y<-max(1,2)
> y
[1] 2
# INTEGER DIVISION AND REMAINDER
> 7%2
Error: unexpected input in "7%2"
> 7%/% 2
[1] 3
> 7%% 2
[1] 1
# VARIABLES
> x<- sqrt(2)
> x
[1] 1.414214
# BOOLEANS
> flag<-TRUE
> falgError: object 'falg' not found
> flag
[1] TRUE

R as a Programming Language

R also allows for programming statements. Which is very useful for calculating things like:

runningTotal<- 0
for (i in 1:20) {
runningTotal<- runningTotal + i^3
}
runningTotal
[1] 44100
for (i in 1:20) { 
print("Hello R")
}
[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"
if (runningTotal > 40000) { 
print("runningTotal is big")
}else {
print("runningTotal is smol")
}
[1] "runningTotal is big"

Here are the conditional and boolean operators:

  • <
  • <
  • ==
  • ! =
  • < =
  • = <
  • &
  • |

RStudio

So where do you do R stuff? RStudio is the choice for most people. In order to install it, you gotta install R first. RStudio is the IDE.

Here are some descriptions of the tabs:

Environment

Where all variables and objects are saved and presented

History

Past commands previously run

Files

Provides access to the file system

Plots

Produced graphs are displayed here

Packages

Where packages can be installed, uninstalled, updated and loaded

Help

Which provides ddocumentation for the packages and functions

To set the working directory:

  • Main Menu
  • Session
  • Set Working Directory
  • Choose Directory

Or type

setwd("C:/MyFolder/")

Data Structures

R can handle a lot of data structures.

Vectors or Arrays

# DEFINE A VECTOR
> y = c(4, 6, 2, 42)
> y
[1] 4 6 2 42# IMPORT FROM A FILE
> z = scan("z_vectors.txt")
# ACCESSING ARRAY ELEMENT
> y[2]
[1] 6 # notice that R is not 0 indexed like other programming languages
# ACCESSING ARRAY RANGE
> y[1:3]
[1] 4 6 2
# VECTOR MATH
> y+z #vector addition, each element of y adds to respective element in z
> y*z
> y/z
> y+3 #scalar addition, each element of y gets 3 added on it

Matrices

> m <- matrix(scan("m_matrix.txt"), nrow=3, ncol5, byrow=TRUE)# A MATRIX OF SIZE 3x5
> m <- matrix(c(4,5,6,3,1,6,10,11,5,2,9,7,-4,0,1), nrow=3, ncol=5, byrow=TRUE)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 4 5 6 3 1
[2,] 6 10 11 5 2
[3,] 9 7 -4 0 1

Array of Any Size

# DEFINE A 3D ARRAY WITH RANDOM NUMBERS
# AKA 3D TENSOR
> b <- array(sample(30), c(3,5,2))
> b
, , 1
[,1] [,2] [,3] [,4] [,5]
[1,] 23 20 14 7 15
[2,] 12 17 3 27 9
[3,] 25 6 16 21 18
, , 2
[,1] [,2] [,3] [,4] [,5]
[1,] 10 24 19 30 8
[2,] 5 28 29 1 13
[3,] 11 4 26 22 2

Data Frames

> name<- c("Johniffer", "Jassica", "Petrovich")
> age<- c(54, 34, 2)
> data.frame(name, age)
name age
1 Johniffer 54
2 Jassica 34
3 Petrovich 2

Reading in Data, Inputting Your Data

Let’s load some data and transform it into a time series and plot it. Here’s the example data:

http://bit.ly/2SXXLjJ

Load Quarterly Data

# LOAD DATA AND TURN INTO TIME SERIES
> ts1 <- ts(scan("ts1.txt"), start=c(2011,1), frequency=4)
Read 20 items
> ts1
Qtr1 Qtr2 Qtr3 Qtr4
2011 3430 3440 3015 2947
2012 3513 3342 3447 4942
2013 4112 3436 2881 3332
2014 3956 3887 4095 3277
2015 3355 3061 3682 2701

# PLOT IT
> plot(ts1, main="Stationary Quarterly Data", ylab="Demand")

Load Monthly Data

> ts2 <- ts(scan("ts2.txt"), start=c(2011, 1), frequency=12)
Read 60 items
> plot(ts2, main="Trend and Seasonal Monthly Data", ylab="Demand")

Intermittent Monthly Data

> ts3 <- ts(scan("ts3.txt"), start=c(2011, 1), frequency=12)
Read 60 items
> plot(ts3, main="Intermittent Monthly Data", ylab="Demand")

Some Useful Functions

Some useful and generic functions include the sequence, repeat, minimum, maximum, summation, arithmetic mean, median, mean absolute deviation, variance and standard deviation.

Sequence

> seq(1, 3, 0.25)
[1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00

Repeat

> rep(4, 10) 
[1] 4 4 4 4 4 4 4 4 4 4

Remember that:

y = c(4, 6, 2, 42)

Min

> min(y)
[1] 2

Max

> max(y)
[1] 42

Mean

> mean(y)
[1] 13.5

Median

> median(y)[1] 5

Mode

# FIND IT YOURSELF HAHA
> table(y)
y
2 4 6 42
1 1 1 1

Absolute Deviation

> mad(y)
[1] 2.9652

Variance

> var(y)
[1] 363.6667

Standard Deviation

> sd(y)
[1] 19.07005

Time Series Exploration

Now let’s do some basic Time Series exploration.

Packages and Functions

We will be using forecast and TStools packages. Load them by:

# THE STRUGGLE TO DOWNLOAD THIS SHIT
library(forecast)
install.packages("forecast")
library(forecast)
# THE STRUGGLE TO DOWNLOAD THIS SHIT
library(TStools)
install.packages("devtools")
devtools::install_github("trnnick/TStools")
if (!require("devtools")){install.packages("devtools")}
devtools::install_github("trnnick/TStools")
then enable in packages tab
then enable tsutils

Time Series Components

Let’s do some elementary exploration where we look for the presence of trend and seasonality

> ts2 <- ts(scan("ts2.txt"), start=c(2011, 1), frequency=12)
Read 60 items
> y <- ts2
> plot(y)
just the data

Central Moving Average, CMA, cmav()

# THIS FUNCTION CAN TAKE IN VECTORS OR TS OBJECTScma <- cmav(y, outplot=1)
print(cma)
you can see the trend here
# CMA WITHOUT BACK AND FORE-CASTING OF MISSING VALUES
cmav(y, outplot=1,fill=FALSE)
no extrapolation here

Cox-Stuart

# COX STUART ON THE CMA TO TEST FOR SIGNIFICANT TREND
> coxstuart(cma)
$H
[1] 1
$p.value
[1] 9.313226e-10
$Htxt
[1] "H1: There is change (upwards or downwards)."

Seasonality

# WE CAN TEST FOR SEASONALITY VISUALLY FROM THE PLOT
> seasplot(y)
Results of statistical testing
Evidence of trend: TRUE (pval: 0)
Evidence of seasonality: TRUE (pval: 0)
these are repeated enough times that it is recognized as a pattern

Seasonal Box Plot

> seasplot(y, outplot =2)
Results of statistical testing
Evidence of trend: TRUE (pval: 0)
Evidence of seasonality: TRUE (pval: 0)
different type of plot

Seasonal Subseries Plot

> seasplot(y, outplot =3)
Results of statistical testing
Evidence of trend: TRUE (pval: 0)
Evidence of seasonality: TRUE (pval: 0)
another type of plot

Seasonal Distribution

> seasplot(y, outplot =4)
Results of statistical testing
Evidence of trend: TRUE (pval: 0)
Evidence of seasonality: TRUE (pval: 0)
another type of plot

SeasPlot from Forecast

> seasplot(y, outplot =4)
Results of statistical testing
Evidence of trend: TRUE (pval: 0)
Evidence of seasonality: TRUE (pval: 0)

Time Series Decomposition

The idea behind decomposition is that, when we decompose the data into it’s constituent parts, we can getter better forecasts. We generally decompose into 3 things: trends, seasonality, irregularity.

Decomposition

decomp(y, outplot=1)
the trend is up, there is clear seasonality and the there’s some irregularity as well

Pure Seasonal Decomposition

Classical decomposition using seasonal smoothing to estimate the seasonal invidices and providing forecasts for the next 12 periods

> y.dc <- decomp(y, outplot=1, type="pure.seasonal", h=12)
here we see that if we decompose it we, can actually get separate forecasts and then we can combine them to get a final composite forecast

Control Chart of the Residuals

Different shades define different ranges of standard deviation.

> y.dc <- decomp(y, outplot=1, type="pure.seasonal", h=12) 
> residout(y.dc$irregular)
the reds are the outliers and the blacks are within the range of close standard deviations

STL Time Series Decomposition

Also seasonal package offers an interface for ARIMA for a more advanced time series decomposition.

> y.stl <- stl(y, s.window = 7)
> plot(y.stl)

Autocorrelation and Partial Autocorrelation Functions

Auto-correlation is the idea of self correlation. Where correlation is the measure of the difference between 2 distributions or the strength of the relationship between 2 variables. Autocorrelation is the idea of moving the time series a period up or down and comparing it with itself, just at different times.

Autocorrelation Function

> acf(y, lag.max = 36)
generally time series with seasonality tend to start at a large value and decrease over time, because closer data points tend to be closer in value, and recent data points tend to be further from earlier data points

Partial Autocorrelation Function

> pacf(y, lag.max = 36)
partial autocorrelations basically do the same thing as acf, but with one important difference, the pacf removes the effect of correlations due to the terms at shorter lags

Differencing

Differencing removes the trend and seasonality so that you get a nice mean.

> plot(diff(y, 1))

ACF on the Differenced Function

> acf(diff(y, 1), lag.max = 36)

Differencing a Differencing Function

> plot(diff(diff(y, 12), 1))

Next Up

This article became too long so go check out the next one… And the one after that… I’ll eventually work my way from forecasting to machine learning and lots of other stuff. If you would like me to write another article explaining a topic in depth, please leave a comment.

For table of content and more content click here.

References

--

--