Forecasting with R: Trends and Seasonality
Useful Functions to Decomposition to Autocorrelation
R is a statistical computing language. What does that mean? Much like the wave-particle duality, R can be thought of as a programming language and a “Swiss army knife” calculator tool.
It’s a statistical analysis tool, used in finance, machine learning and anywhere we need to deal with quantifiable uncertainties.
Before we delve into the nuts and bolts of forecasting, we’ll do a quick mind refresher of all basic R stuff.
R as a Calculator
Relatively straightforward, just like using the computer calculators. A little more useful in fact, since you can do expressions and you get lots of additional math functions.
# ADDITION
> 2+2
[1] 4# EXPRESSION
> (2+2)/4_1
Error: unexpected input in "(2+2)/4_"
> (2+2)/4-1
[1] 0# WEIRD MATH
> 1/0
[1] Inf
> 0/1
[1] 0
> 0/0
[1] NaN# MATH FUNCTIONS
> sqrt(3)
[1] 1.732051
> 4+5^2
[1] 29
> y<-max(1,2)
> y
[1] 2# INTEGER DIVISION AND REMAINDER
> 7%2
Error: unexpected input in "7%2"
> 7%/% 2
[1] 3
> 7%% 2
[1] 1# VARIABLES
> x<- sqrt(2)
> x
[1] 1.414214# BOOLEANS
> flag<-TRUE
> falgError: object 'falg' not found
> flag
[1] TRUE
R as a Programming Language
R also allows for programming statements. Which is very useful for calculating things like:
runningTotal<- 0
for (i in 1:20) {
runningTotal<- runningTotal + i^3
}
runningTotal [1] 44100
for (i in 1:20) {
print("Hello R")
} [1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"[1] "Hello R"
if (runningTotal > 40000) {
print("runningTotal is big")
}else {
print("runningTotal is smol")
} [1] "runningTotal is big"
Here are the conditional and boolean operators:
- <
- <
- ==
- ! =
- < =
- = <
- &
- |
RStudio
So where do you do R stuff? RStudio is the choice for most people. In order to install it, you gotta install R first. RStudio is the IDE.
Here are some descriptions of the tabs:
Environment
Where all variables and objects are saved and presented
History
Past commands previously run
Files
Provides access to the file system
Plots
Produced graphs are displayed here
Packages
Where packages can be installed, uninstalled, updated and loaded
Help
Which provides ddocumentation for the packages and functions
To set the working directory:
- Main Menu
- Session
- Set Working Directory
- Choose Directory
Or type
setwd("C:/MyFolder/")
Data Structures
R can handle a lot of data structures.
Vectors or Arrays
# DEFINE A VECTOR
> y = c(4, 6, 2, 42)
> y[1] 4 6 2 42# IMPORT FROM A FILE
> z = scan("z_vectors.txt")# ACCESSING ARRAY ELEMENT
> y[2]
[1] 6 # notice that R is not 0 indexed like other programming languages# ACCESSING ARRAY RANGE
> y[1:3]
[1] 4 6 2# VECTOR MATH
> y+z #vector addition, each element of y adds to respective element in z
> y*z
> y/z
> y+3 #scalar addition, each element of y gets 3 added on it
Matrices
> m <- matrix(scan("m_matrix.txt"), nrow=3, ncol5, byrow=TRUE)# A MATRIX OF SIZE 3x5
> m <- matrix(c(4,5,6,3,1,6,10,11,5,2,9,7,-4,0,1), nrow=3, ncol=5, byrow=TRUE)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 4 5 6 3 1
[2,] 6 10 11 5 2
[3,] 9 7 -4 0 1
Array of Any Size
# DEFINE A 3D ARRAY WITH RANDOM NUMBERS
# AKA 3D TENSOR> b <- array(sample(30), c(3,5,2))
> b
, , 1
[,1] [,2] [,3] [,4] [,5]
[1,] 23 20 14 7 15
[2,] 12 17 3 27 9
[3,] 25 6 16 21 18
, , 2
[,1] [,2] [,3] [,4] [,5]
[1,] 10 24 19 30 8
[2,] 5 28 29 1 13
[3,] 11 4 26 22 2
Data Frames
> name<- c("Johniffer", "Jassica", "Petrovich")
> age<- c(54, 34, 2)
> data.frame(name, age)
name age
1 Johniffer 54
2 Jassica 34
3 Petrovich 2
Reading in Data, Inputting Your Data
Let’s load some data and transform it into a time series and plot it. Here’s the example data:
Load Quarterly Data
# LOAD DATA AND TURN INTO TIME SERIES
> ts1 <- ts(scan("ts1.txt"), start=c(2011,1), frequency=4)
Read 20 items
> ts1
Qtr1 Qtr2 Qtr3 Qtr4
2011 3430 3440 3015 2947
2012 3513 3342 3447 4942
2013 4112 3436 2881 3332
2014 3956 3887 4095 3277
2015 3355 3061 3682 2701
# PLOT IT
> plot(ts1, main="Stationary Quarterly Data", ylab="Demand")
Load Monthly Data
> ts2 <- ts(scan("ts2.txt"), start=c(2011, 1), frequency=12)
Read 60 items
> plot(ts2, main="Trend and Seasonal Monthly Data", ylab="Demand")
Intermittent Monthly Data
> ts3 <- ts(scan("ts3.txt"), start=c(2011, 1), frequency=12)
Read 60 items
> plot(ts3, main="Intermittent Monthly Data", ylab="Demand")
Some Useful Functions
Some useful and generic functions include the sequence, repeat, minimum, maximum, summation, arithmetic mean, median, mean absolute deviation, variance and standard deviation.
Sequence
> seq(1, 3, 0.25)
[1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Repeat
> rep(4, 10)
[1] 4 4 4 4 4 4 4 4 4 4
Remember that:
y = c(4, 6, 2, 42)
Min
> min(y)
[1] 2
Max
> max(y)
[1] 42
Mean
> mean(y)
[1] 13.5
Median
> median(y)[1] 5
Mode
# FIND IT YOURSELF HAHA
> table(y)
y
2 4 6 42
1 1 1 1
Absolute Deviation
> mad(y)
[1] 2.9652
Variance
> var(y)
[1] 363.6667
Standard Deviation
> sd(y)
[1] 19.07005
Time Series Exploration
Now let’s do some basic Time Series exploration.
Packages and Functions
We will be using forecast and TStools packages. Load them by:
# THE STRUGGLE TO DOWNLOAD THIS SHIT
library(forecast)install.packages("forecast")
library(forecast)
# THE STRUGGLE TO DOWNLOAD THIS SHIT
library(TStools)install.packages("devtools")
devtools::install_github("trnnick/TStools")if (!require("devtools")){install.packages("devtools")}
devtools::install_github("trnnick/TStools")then enable in packages tab
then enable tsutils
Time Series Components
Let’s do some elementary exploration where we look for the presence of trend and seasonality
> ts2 <- ts(scan("ts2.txt"), start=c(2011, 1), frequency=12)
Read 60 items
> y <- ts2
> plot(y)
Central Moving Average, CMA, cmav()
# THIS FUNCTION CAN TAKE IN VECTORS OR TS OBJECTScma <- cmav(y, outplot=1)
print(cma)
# CMA WITHOUT BACK AND FORE-CASTING OF MISSING VALUES
cmav(y, outplot=1,fill=FALSE)
Cox-Stuart
# COX STUART ON THE CMA TO TEST FOR SIGNIFICANT TREND
> coxstuart(cma)
$H
[1] 1$p.value
[1] 9.313226e-10$Htxt
[1] "H1: There is change (upwards or downwards)."
Seasonality
# WE CAN TEST FOR SEASONALITY VISUALLY FROM THE PLOT
> seasplot(y)
Results of statistical testing
Evidence of trend: TRUE (pval: 0)
Evidence of seasonality: TRUE (pval: 0)
Seasonal Box Plot
> seasplot(y, outplot =2)
Results of statistical testing
Evidence of trend: TRUE (pval: 0)
Evidence of seasonality: TRUE (pval: 0)
Seasonal Subseries Plot
> seasplot(y, outplot =3)
Results of statistical testing
Evidence of trend: TRUE (pval: 0)
Evidence of seasonality: TRUE (pval: 0)
Seasonal Distribution
> seasplot(y, outplot =4)
Results of statistical testing
Evidence of trend: TRUE (pval: 0)
Evidence of seasonality: TRUE (pval: 0)
SeasPlot from Forecast
> seasplot(y, outplot =4)
Results of statistical testing
Evidence of trend: TRUE (pval: 0)
Evidence of seasonality: TRUE (pval: 0)
Time Series Decomposition
The idea behind decomposition is that, when we decompose the data into it’s constituent parts, we can getter better forecasts. We generally decompose into 3 things: trends, seasonality, irregularity.
Decomposition
decomp(y, outplot=1)
Pure Seasonal Decomposition
Classical decomposition using seasonal smoothing to estimate the seasonal invidices and providing forecasts for the next 12 periods
> y.dc <- decomp(y, outplot=1, type="pure.seasonal", h=12)
Control Chart of the Residuals
Different shades define different ranges of standard deviation.
> y.dc <- decomp(y, outplot=1, type="pure.seasonal", h=12)
> residout(y.dc$irregular)
STL Time Series Decomposition
Also seasonal package offers an interface for ARIMA for a more advanced time series decomposition.
> y.stl <- stl(y, s.window = 7)
> plot(y.stl)
Autocorrelation and Partial Autocorrelation Functions
Auto-correlation is the idea of self correlation. Where correlation is the measure of the difference between 2 distributions or the strength of the relationship between 2 variables. Autocorrelation is the idea of moving the time series a period up or down and comparing it with itself, just at different times.
Autocorrelation Function
> acf(y, lag.max = 36)
Partial Autocorrelation Function
> pacf(y, lag.max = 36)
Differencing
Differencing removes the trend and seasonality so that you get a nice mean.
> plot(diff(y, 1))
ACF on the Differenced Function
> acf(diff(y, 1), lag.max = 36)
Differencing a Differencing Function
> plot(diff(diff(y, 12), 1))
Next Up
This article became too long so go check out the next one… And the one after that… I’ll eventually work my way from forecasting to machine learning and lots of other stuff. If you would like me to write another article explaining a topic in depth, please leave a comment.
For table of content and more content click here.