Introduction To Time Series Analysis -Part 1

Dr. Vipul Joshi
Time Series Using Python
9 min readApr 17, 2024
https://datasciencelk.com/time-series-analysis/

What is Time Series and why is it special?

In very simple terms, a time series is a collection of observations of a phenomenon with time. For example, the price of an iPhone (variable y) with respect to time in days (variable t). Thus, you can represent it as :

y = f(t) meaning y (price of iPhone) is a function of t (time).

Time series is different than other type of data because of following reasons:

  • Chronology of Data Points: The observations (y) are recorded as the time passes (t) at some time steps (t1, t2, t3). Any model or analysis will have to be consistent with this chronology — which means data preparation for training/testing/validation will have to be consistent with it. For example, the price of an iPhone might change by 1% in a consecutive week as follows:

1. Week 1: USD 1000
2. Week 2: USD 990
3. Week 3: USD 980.1
4. Week 4: USD 970.299
5. Week 5: USD 960.6

  • Trends and Seasonality: Time series data will often have trends (price has gone up, down, or stayed constant with time) or seasonality (price goes down every year near Christmas) in them. It’s extremely important to analyze the effect of these factors BEFORE we try to do any forecasting.
  • Time Steps: A time series data is also characterized by its Time Steps. A Time Step refers to the interval or space between the observations. In our example, the Time Step was “1 Week”. Frequency is defined as 1/Time Step and is used to refer to “How frequently something changes”. For example, the price of an iPhone dropped at a frequency of 1% per Week.
  • Auto-Correlation: The data points of the time series may be dependent on its previous values which means there can be “correlation” between observations being made now to the observation recorded earlier in time. For example, the price of an iPhone at Week 6 = Price at Week 5–0.01* Price at Week 5. This means that price of an iPhone on Week 6 is dependent on the price of Week 5.

The goal of this tutorial is to go through the above properties of time series in a very simple, and easy to follow manner.

Let’s begin with creating a SIMPLE Time Series.

#### Any time series requires two things : a. Time Steps (t1, t2, t3...) and b. observations at those time steps (y1, y2, y3...)
#### Let us assume that we are walking on top of a mountain. The height of the mountain (at 1000 ft) follows a sin curve (which goes above and below 1000 ft by 10 ft) as shown in the figure and has a frequency = 2*pi.
#### We measure the height every 1 minute for 100 minutes and observe that we were able to complete ascend and descend two times in 100 minutes
#### We will also capture the time at which we reached the peak and the troughs
import numpy as np
import matplotlib.pyplot as plt

omega = 2*np.pi
t = np.linspace(0, 100)
y = 1000 + 10*np.sin(omega * t)

# Marking peaks and troughs
peaks = t[1:-1][(y[1:-1] > y[:-2]) & (y[1:-1] > y[2:])] # Find local max
troughs = t[1:-1][(y[1:-1] < y[:-2]) & (y[1:-1] < y[2:])] # Find local min

#start plotting data

plt.figure(figsize=(10, 5))
plt.scatter(peaks, 1000 + 10 * np.sin(omega * peaks), color='r', zorder=5) # Red for peaks
plt.scatter(troughs, 1000 + 10 * np.sin(omega * troughs), color='g', zorder=5) # Green for troughs

# Annotating peaks and troughs
for peak in peaks:
plt.annotate(f'Peak\n({peak:.2f}, {1000 + 10 * np.sin(omega * peak):.2f})',
(peak, 1000 + 10 * np.sin(omega * peak)), textcoords="offset points", xytext=(0,-8), ha='center')
for trough in troughs:
plt.annotate(f'Trough\n({trough:.2f}, {1000 + 10 * np.sin(omega * trough):.2f})',
(trough, 1000 + 10 * np.sin(omega * trough)), textcoords="offset points", xytext=(0,-10), ha='center')

plt.plot(t, y)
plt.title("Time Series of Altitude Changes Over Time")
plt.xlabel("Time (minutes)")
plt.ylabel("Altitude (ft)")
plt.grid(True)
plt.show()
Output

Trend & Seasonality — Walking over a rough, steep mountain

A trend in time series data represents a long-term increase or decrease in the data. It shows whether something is generally going up or down over a period. For example, climbing “up” a mountain will mean the “trend” of elevation is “up” trend

Seasonality refers to patterns that repeat over a known, fixed period. For example, while climbing up the mountain, we will also be ascending and descending through local peaks and troughs. This can be called as “seasonality” of the elevation

Noise refers to random fluctuations in the data. For example, when going up the mountain the surface is not smooth and it going to have rough surface which will add some noise to your elevation with time.

Let’s simulate a time series data where we Hike over a rough, steep mountain:

  • You hike up a mountain over 100 minutes.
  • The elevation gain increases as you hike further, representing a positive trend.
  • There are periodic rests or flat segments in your hike, representing seasonal effects (such as resting every 20 minutes).

The elevation gain increases as you hike further, representing a positive trend. There are periodic rests or flat segments in your hike, representing seasonal effects (such as resting every 20 minutes).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress

# 100 minutes of hiking
minutes = np.arange(0, 100)

# let's say we started at 1000ft : initial height = 1000ft
initial_height = 1000

# and we gain 3ft/minute
linear_height_increase = 3 * minutes

# lets say that the mountain has ascend and descend (like in earlier example). Peak = 50 ft, every cycle completes in 20 minutes
# so we will gain "seasonal" height of
seasonal_height = 50 * np.sin(2*np.pi * minutes/20.0)

# Lets be a little more realistic and add some random roughness to the elevation gained
roughness_elevation = np.random.normal(0, 10, size=100)


elevation_gain = initial_height + linear_height_increase + seasonal_height + roughness_elevation

# Creating DataFrame
df = pd.DataFrame({
'Minute': minutes,
'Elevation': elevation_gain
})

# Plotting the data
plt.figure(figsize=(10, 5))
plt.plot(df.Minute, df.Elevation, label='Elevation Gain', marker='o', linestyle='-')
plt.title('Elevation Gain as You Hike Up a Mountain')
plt.xlabel('Time (Minutes)')
plt.ylabel('Elevation (Feet)')
plt.grid(True)
plt.legend()
plt.show()

Visualising Trend, Seasonality & Residuals

In the above example, we have already computed the following:

  1. Trend Component — linear_height_increase
  2. Seasonality Component — seasonal_height
  3. Residuals — roughness_elevation
  4. Offset — initial_height

Decomposition: Breaking down a time series into its Trend, Seasonal, and Residual components is called Decomposing a time series

If we were given the time vs elevation data and were asked to find its trend, seasonality, and residuals, then we would have to “decompose the data”. The result would have been:

plt.figure(figsize = (15,10))
plt.subplot(511)
plt.plot(df.Minute, df.Elevation, label='Elevation Gain', linestyle='-')
plt.legend()

plt.subplot(512)
plt.plot(minutes, np.linspace(1000, 1000, 100), label = "initial height", linestyle = '--')
plt.legend()

plt.subplot(513)
plt.plot(minutes, linear_height_increase, label='Trend', linestyle='--')
plt.legend()

plt.subplot(514)
plt.plot(minutes, seasonal_height, label='Seasonality', linestyle='--')
plt.legend()

plt.subplot(515)
plt.plot(minutes, roughness_elevation, label='Residuals', linestyle='--')

plt.legend()

Time Series Decomposition

In the previous example, we simulated an example where you are hiking up a mountain. To do so, we created the Trend, Seasonal, and Residual components. However, in most cases, you will encounter an existing time series. It is important to analyze the unseen data to understand its underlying patterns and components. Typically, in real-world scenarios, you don’t have the luxury of designing your data; instead, you must interpret and model what is given to you. This could involve identifying trends, seasonality, and random variations within your dataset. How do you approach this?

  1. Plot: Plot the data and perform visual inspection to get idea of the trend, seasonality, and noise.
  2. Decomposition: Decomposition techniques can be used to separate the time series into its components. But you have to understand the “nature of composition” of the time series :
  • Additive Model: If the Trend, Seasonality and Residuals simply “add together” and the seasonality is constant throughout the time, then it’s called an Additive Model of Time Series. y = Trend(t) + Seasonality(t) + Residual(t) For example, the above elevation gain while hiking is an example of Additive Model.
  • Multiplicative Model: If the seasonality changes over time (let’s say the local’s peaks and troughs of the terrain had different altitudes and didn’t occur at “fixed intervals”) then the time series can no longer be expressed as an additive model. For such cases, it’s useful to consider a “multiplicative model” y = Trend(t) * Seasonality(t) *Residuals(t)

In reality, the time series may be even more complex.

To Decompose the time series, we can use seasonal_decompose function of a very useful library “statsmodels

from statsmodels.tsa.seasonal import seasonal_decompose
# Decomposing the time series
result = seasonal_decompose(df['Elevation'], model='additive', period=20) # period is the expected frequency of the seasonality

# Plotting the decomposed components
fig = result.plot()
fig.set_size_inches(15, 8)
plt.show()

Auto-Correlation Function

In our hiking example, it is established that the subsequent peaks are separated by PERIOD = 20 minutes. Similarly, the subsequent troughs are also separated by PERIOD = 20 minutes. This means that if I am at the peak of an elevation, then 20, 40, 60, 80, 100, 120, 20 * t minutes later I can expect to have found subsequent peaks. Isn’t it fascinating? I can now predict peaks and troughs of my journey ✌

How can we find this “PERIOD” for any time-series?

The Autocorrelation Function (ACF) measures how correlated data points are with each other based on different time lags. If a time series and its lagged version are identical at some time lag then ACF = 1. If a time series and its lagged version are opposite of each other at some time lag then ACF = -1.

For seasonal data, you often see clear peaks at lags corresponding to the seasonality period.

We’ll use the plot_acf function from the statsmodels library to visualize autocorrelation and identify the lag at which there is a significant peak, indicating the periodicity.

plt.figure(figsize = (10,15))

plt.subplot(511)
y_lagged = np.roll(elevation_gain, 1)
plt.plot(elevation_gain[1:])
plt.plot(y_lagged[1:])
plt.title("Lag = 1 \t" + "correlation = " + str(np.corrcoef(elevation_gain[1:] , y_lagged[1:])[0,1]))

plt.subplot(512)
y_lagged = np.roll(elevation_gain, 5)
plt.plot(elevation_gain[5:])
plt.plot(y_lagged[5:])
plt.title("Lag = 5 \t" + "correlation = " + str(np.corrcoef(elevation_gain[5:], y_lagged[5:])[0,1]))

plt.subplot(513)
y_lagged = np.roll(elevation_gain, 10)
plt.plot(elevation_gain[10:])
plt.plot(y_lagged[10:])
plt.title("Lag = 10 \t" + "correlation = " + str(np.corrcoef(elevation_gain[10:], y_lagged[10:])[0,1]))

plt.subplot(514)
y_lagged = np.roll(elevation_gain, 15)
plt.plot(elevation_gain[15:])
plt.plot(y_lagged[15:])
plt.title("Lag = 15 \t" + "correlation = " + str(np.corrcoef(elevation_gain[15:], y_lagged[15:])[0,1]))

plt.subplot(515)
y_lagged = np.roll(elevation_gain, 20)
plt.plot(elevation_gain[20:])
plt.plot(y_lagged[20:])
plt.title("Lag = 20 \t" + "correlation = " + str(np.corrcoef(elevation_gain[20:], y_lagged[20:])[0,1]))
import numpy as np
import matplotlib.pyplot as plt

# Compute auto-correlations for different lags
auto_correlations = [np.corrcoef(elevation_gain[i:], np.roll(elevation_gain,i)[i:])[0, 1] for i in range(1, 60)]

# Plotting the auto-correlation
plt.figure(figsize=(10, 5))
plt.stem(range(1, 60), auto_correlations, use_line_collection=True)
plt.title('Auto-correlation of Elevation Gain Over Time')
plt.xlabel('Lag (Minutes)')
plt.ylabel('Auto-correlation')
plt.grid(True)
plt.show()

In the above graph for ACF, you can see that the peak occurs at 0, 20, 40, 60 minutes. This means, that the ELEVATION GAIN and ELEVATION GAIN lagged by 0, 20, 40 minutes are EXACTLY SAME and thus they are perfectly correlated

QUESTION: At Lag Ten (previous image), the peaks and troughs of the original and lagged time series align opposite to each other. This means, their correlation should have been -1. However, it comes out 0.6. WHY? EMAIL ME ANSWER AT just.do.python@gmail.com

In the next Lab, we will see how we use the knowledge to build models capable of predicting the FUTURE

Till Then 😴😴

Keep Coding 🤟

--

--