Smoothing Time Series in Python: A Walkthrough with Covid-19 Data

Joe McHugh
4 min readAug 18, 2020

--

This will be a brief tutorial highlighting how to code moving averages in python for time series. More complicated techniques such as Hodrick-Prescott (HP) filters and Loess smoothing will not be covered.

Being able to smooth out volatile time series data is a crucial tool in a data scientist’s tool box. When volatile data is smoothed, long term trends become clearer. To demonstrate, here is a time series before an after smoothing:

When one reviews the Covid-19 data, what becomes evident is that a sinusoidal pattern exists in the daily new cases data. Whilst baffling at first, the cause is quite intuitive: habitually, fewer individuals leave the house on the weekends and thus fewer people are being tested on the weekends. This is why we see a drop and subsequent rise in new cases every seven days:

Daily New Covid-19 Cases

This data series is a prime example of when data smoothing can be applied. With the constant “jitteriness” in the data, it can be difficult to discern emerging trends in the number of new Covid-19 cases. Smoothing solves this problem.

The following code will demonstrate how to do this with a moving average. As always, the first thing I do in python is import all the packages I’m going to use:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime

The next step is to read the data into python using pandas. The data is stored as a csv file that I’ve downloaded and saved on my local hardrive:

df_nat = pd.read_csv('covid-19-data/us.csv')

I do some brief data cleaning by converting the date column and cases column (which are both strings) into a datetime object and numeric object respectively. I then feature engineer two columns by calculating the first and second derivative of the number of cases:

df_nat.date = pd.to_datetime(df_nat.date)
df_nat.cases = pd.to_numeric(df_nat.cases)
df_nat['new_cases'] = df_nat.cases.diff()
df_nat['growth_new_cases'] = df_nat.new_cases.diff()

When I plot new_cases it looks like the image I showed earlier, very jittery and volatile:

sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(
x='date',
y='new_cases',
data= df_nat)
plt.show()

I calculate the moving average by feature engineering a new column using panda’s built-in rolling method. I chose a window of seven days because the wavelength of the sinusoidal pattern in the data is one week (since new cases rise and fall around the weekends):

df_nat['mov_avg'] = df_nat['new_cases'].rolling(7).sum()

Now when I graph the smoothed data calculated with the moving average, the series looks like this:

We can now see clearly how the number of new cases trended downward during the lockdown, accelerated rapidly during the reopening, and now appear to be trailing off again.

We can also perform this smoothing on the second derivative, ie. examining the growth in the daily new cases to discern any emerging trends:

Growth in Daily New Cases

As one can see, the graph of the second derivative of Covid-19 cases looks a mess. There are huge spikes above and below zero, with the series looking almost like white noise. However, once smoothing is applied with the same 7-day moving average the data becomes much clearer:

Smoothed Growth in Daily New Cases

We can now see how the growth in daily new cases (a crucial leading indicator for public health officials) is changing over time. We can see there is a huge period of new case growth during march, a relatively stable period of growth during the lockdown, another spike in growth during the reopening, followed by another drop. What’s encouraging is that the current growth in new cases has fallen below the point at which it was during the lockdown.

The moving average is a simple and powerful data smoothing technique. One can go much further an implement more complex methods that are more robust and can address certain problems that the moving average can’t. These include HP filters, Loess smoothing, and various others. However, for those who are looking for a quick and effective method without too much code or calculation, the moving average is a great way to get started.

--

--