Rolling Sum in Scala

Alexey Novakov
SE Notes by Alexey Novakov
3 min readNov 21, 2018

Let’s try to implement the rolling sum calculation over time series data in pure Scala. We will see that the whole algorithm can be based on lists manipulation and aggregations using fold with one more nested loop.

Goal

Calculate rolling sum on time series data windowing data within 60 seconds.

Input Data

Input structure consists of 2 columns: Time and Value.

Time       Value
1355271796 1.80295
1355271836 1.80275
1355271882 1.80295
1355271953 1.80275
1355272046 1.80285
1355272074 1.80275
1355272124 1.80245
1355272179 1.80265
1355272280 1.80235
1355272281 1.80245
1355272313 1.80235

Output Data

Time       Value    Count Sum        Min         Max
1355271795 1.80275 5 9.01395 1.80265 1.80305
1355271796 1.80295 6 10.81690 1.80265 1.80305
1355271836 1.80275 5 9.01395 1.80265 1.80295
1355271882 1.80295 2 3.60570 1.80275 1.80295
1355271953 1.80275 1 1.80275 1.80275 1.80275
1355272046 1.80285 1 1.80285 1.80285 1.80285
1355272074 1.80275 2 3.60560 1.80275 1.80285
1355272124 1.80245 2 3.60520 1.80245 1.80275
1355272179 1.80265 2 3.60510 1.80245 1.80265
1355272280 1.80235 1 1.80235 1.80235 1.80235
1355272281 1.80245 2 3.60480 1.80235 1.80245

Output structure consists of 2 input columns plus 3 new columns which we need to calculate:

  • Count — shows how many observations are in the time window.
  • Sum — sum of the values within the time window.
  • Min / Max — minimum/maximum value in the the time window.

Solution

Our key is Time column which is number of seconds passed since since 1st of January 1970 (Unix epoch time). In order to frame 60 seconds period we need to keep difference between left and right window border when iterating over the input list. We can split whole algorithm into 2 main steps:

  1. Using fold, we iterate over Iterator[Observation], where an Observation is a single instance of an event we observe.
  2. Folding function accumulates two values we carry on going until the end of the observations list:
  • 1st accumulator: obs: List[Observation] — list of observations inside the 60 seconds window. This list is changing once we found next window. We remove list head, i.e. moving left window border to the right.
  • 2nd accumulator: WindowStats(sum, min, max, count) — contains 4 columns we need to calculate to reach our goal. WindowStats keep only statistics which is actual for the current event(observation). We will print current stats to a file before we go to the next window.

Our model case classes for this implementation:

2.1 We go over Iterator[Observation] maintaining temporary List obs. Once next observation.time is greater than head of obs, then we move left border of the obs, by removing its head. Updating window statistics.

Idea of rolling sum is that windows are overlapping, that means next rolling sum for next event contains values from previous events which are inside current 60 seconds time window.

2.2 In case next observation.time is still inside the window, we append it to our obs accumulator and updating count, sum, min, max of the current window statistics.

Bonus Part. Visualize time series via Pandas

Although I do not like Python, because it does not have normal type system, let’s use its library — Pandas, to use already available function for rolling sum. (I will have look for similar Spark implementation next time :-) )

import pandas as pddf = pd.read_csv("/Users/an/dev/git/time-series/data.txt", 
sep='\t', header=None, index_col='Time',
names=['Time', 'Value'])
// horrible mutable code
df.index = pd.to_datetime(df.index, unit='s')
rdf = df.rolling('60s').agg(['count', 'sum', 'min', 'max'])

rdf — is exactly the output data frame which we need:

First rows of Panda

Now let’s visualize columns of this data frame:


import
matplotlib.pyplot as plt
plt.figure()
rdf.plot(subplots=True, figsize=(12, 12));

We can note from this particular data set that rolling sum graph and count graph are quite similar. This means that value contribution is approximately the same over the entire time series, even though the frequency of those events within one minute is different.

Summary

We have seen how one can write Rolling Sum algorithm in Scala and calculate some statistics over simple time series data. We wrote the solution in pure functional way: no mutation, no side-effects, except the function to print the result into outer world like file/console.

Links:

  1. Full Source Code: https://github.com/novakov-alexey/time-series

--

--