Computing Time Series metrics at scale in Google Cloud

Data Surfing
Google Cloud - Community
4 min readJan 13, 2021

This blog post shows how data scientists and engineers can use GCP Dataflow to compute time series metrics in real time or in batch to backfill data at scale, for example, to detect anomalies in market data or IoT devices. It compares some basic statistical metrics (Moving Averages and Standard Deviation) derived from Forex market data, between Python Pandas and the new Apache Beam Time Series library in an easy to follow Jupyter Notebook.

What are we trying to solve?

When referring to time series data we mean data points that are indexed by time, so basically they include timestamps which are relevant for the insights or metrics we want to generate.

Time series example

Quite often we need to generate metrics from this time indexed data points, also referred to as features in Machine Learning, to be used as the inputs for our predictive or inference models.

But other times, these same metrics are also relevant for end users to identify trends or abrupt changes directly, which means that businesses expect to identify these anomalies in trends rapidly to be able to capitalize on them (e.g. price or customer demand changes), or identify risks before they become an issue (e.g., cyberattacks like detecting bots being prepared for a network denial-of-service attack).

In this context, Streaming Analytics aims to compute these metrics continuously, as the data is streamed into the system from unbounded sources, such as telemetry from connected devices, log files generated by customers using web applications, ecommerce transactions, or information from social networks or geospatial services.

Understood, but what are the challenges?

When you are computing these metrics from your time series data, like moving averages, standard deviations or other type of statistical metrics, you often face the below challenges:

  • As a user I want to be able to compute these metrics in real time, applying these metrics in my unbounded datasets, rather than having to calculate these in retrospective, so I can capitalize on them or identify risks in almost real time.
  • I often need to take snapshots of these metrics, what I mean is to have all the metrics values calculated in “sync” across the same time periods.
  • To be able to get these snapshots I actually need to compute these many metrics in parallel and at scale so I can keep up processing data if more events are coming or more metrics are added.
  • I also need to be able to fill gaps when no events are received, as I cannot imply the value is 0 by default, which will affect some of my statistical calculations like Moving Averages.
  • Computing the same metrics historically at scale, to be able to backfill your data with new metrics, but also to bootstrap missing input data for these metrics when I cold start my pipelines.

Hmm, how do you solve that then?

When it comes to obtaining metrics in real time we do have primitives available in streaming analytics frameworks like Apache Beam, which provides rich APIs such as Timers and Looping timers, allowing to process time series data in real time — What this time series library offers you is a set of wrappers to encapsulate these primitives with special aggregations and data objects, so you as a developer can focus on the business value, rather than trying to learn and optimise how best to use these primitives.

Distributed computing in Google Cloud solves the problem about snapshotting and computing all these metrics in “sync” and parallel, by computing these by certain keys distributed across many GCP Dataflow workers, then aggregating results in scalable sinks like GCP Pubsub or BigQuery.

Google Dataflow

This Time Series library also provides mechanisms to fill the gaps when no events are received, defaulting to the last known value for instance.

Last but not least, the same constructs could be used to recompute historical data to:

  • Backfill historical datasets, when for example new metrics are added or historical events missed are provided later.
  • But most importantly, when cold starting a streaming pipeline to hydrate the workers with previous events.

Anything you don’t cover?

There are many engines and architecture patterns for batch and retrospective processing of time series data. The main objective of this time series library is to facilitate computing these metrics in real time and at scale when data doesn’t fit in memory, rather than focusing exclusively on the batch/large scale type of computation.

Let’s see!

Setup

Go to the Notebook in Github to follow the prerequisite and setup steps before trying the below sections yourself.

Computing in Time Series streaming framework

Computing in Pandas

Comparing results

Filling the gaps

Next steps

As mentioned, try to run the Notebook following the steps to learn more about the library and how it can meet your needs, for example to apply real time analytics for Anomaly Detection, and also feel free to contact your sales representatives if interested to hear more.

For more information about this and other reference patterns, please visit our GCP official website.

--

--

Data Surfing
Google Cloud - Community

David Sabater Dinter — Product Manager @Google, Data nerd & co-founder of Data Reply UK