Analyzing time series data — SQL

Festus Nyarandi
6 min readJul 26, 2022

--

Introduction

More and more data is being collected from sensors, user interactions with web applications and performance metrics. One thing all of these have in common is that they can be modeled as a series of events that happen over time. Analyzing time series data like this can be challenging, but SQL has evolved to include features that support just the kind of analysis we need.

By the end of this article, you will have learnt about time series data, ways of working with it using abstractions like sliding windows and tumbling windows, and how to use SQL constructs, like over and partition by, to simplify analysis of time series data. You’ll also learn about the benefits of denormalizing time series data to avoid joins and when to limit your use of denormalizing. You’ll also be familiar with common analysis patterns, like moving averages, comparisons across periods, exponential smoothing, and forecasting with linear regression.

Let’s get started.

Characteristics of time series data

Time stamp. Time stamps usually include a date and then a time, sometimes down into milliseconds. In the time series data that we’re going to work with, our data will be generated at regular intervals and each of the data data points will have one or more measurements.

Now when we talk about intervals what we’re really talking about is the frequency. How often is the data points sent to us and that can vary by application so for example if you’re measuring CPU utilization you might be measuring that in terms of seconds, even mircoseconds. But if you’re measuring something like births and deaths in the human population probably annually would be enough. Many different types of time series data use different intervals or different frequencies.

Unit of measure. In time series data we have measurements usually their just numbers but what do those numbers represent? That’s the kind of thing you need to know when you’re working with time series data because a unit of measure is typically not included with the data. Since it’s always the same it would be redundant to carry the unit of measure along with the data points.

So some common units of measure are percentages for example if you’re looking at CPU utilization or free memory available. If you’re looking at the number of units produced or customers served the our unit of measure is count. Sometimes if you’re dealing with financial data like company profit then you’ll be in some kind of monetary unit like dollars or euros. Again the unit of measure will vary by application.

Types of measurements. A common one is a counter, counters monotonically increase so for example it might be the number of cars that pass through a toll booth. That number keeps going up, it will never go back. Another common metric type is called a gauge and a gauge is a numerical measure that can go up or down. So for example the temperature of a room is a gauge. Summary vales calculate values over some period of time or some time window. And these could be counts or they could be rates. Less often but sometimes we’ll see histograms. And histograms are counts of items over buckets. We’re going to work with time series in relational databases so we’ll be querying it with SQL. I will briefly discuss some data modeling choices and mostly we’ll focus on time series analysis.

Writing time series data

Let’s look at some data modeling issues that we should consider when working with time series data.

The first phase of working with time series data is acquiring the data, often called the data ingest phase. Data is generated from a data source, and it comes as a series of data points. We store those data points in a data store, in our case, we’re using a relational database. Typically, we want to write the data to disk in the order in which we receive it.

Now, we have the potential for generating quite large tables, and one of the problems with that is that they can be difficult to query and manage. So one of the things we want to think about when we’re working with time series data is how to horizontally partition, or split these large tables by rows into smaller subtables. So a partition is essentially a table.

There are different ways of partitioning. One common way is called range partitioning. When we use range partitioning, we break a large table up into partitions that don’t overlap. And typically, we’ll use the date or timestamp as the key to determine how to allocate data to different partitions. Now, we want to use partitioning by range when we often query the latest data, or if we’re doing a lot of comparative queries.

So for example, comparing average sales for this quarter compared to average sales in the same quarter last year. We also want to use range partitioning if we typically keep the data for a particular period of time and then drop it, or maybe summarize the data but delete all of the detailed data. That’s much easier to do when you have partitions that are organized by date. And all of these, querying latest data, performing comparative queries, and dropping older data are all common in time series applications.

Querying time series data

There are common query patterns we see when working with time series data. We often query the latest data. Another common practice is comparing time periods. For example, comparing the average revenue generated in a particular store on a particular day versus that same day in the prior month. Another common pattern is to summarize or perform some other aggregate function over a time window.

Now, it’s important to think about granularity when we talk about time series, and the granularity is defined by the frequency of the data that is generated. Now, typically, the latest data is very fine grained. So, for example, it might be at the second or minute level. Now, over time, as the data ages, we’re less likely to really want to query the fine-grained data. So, for example, we don’t necessarily want to know what the CPU utilization was on a particular server six week’s ago at one in the afternoon. Instead, we’re more likely to want to know summaries or aggregates about data at those longer time intervals. And, that’s because we have different information drivers.

There’s different questions we are trying to answer with that data. When we’re working with the latest data, we want to know things like, what’s the current CPU utilization, or why is throughput down. So, we’re trying to get information that we can use to address problems that need to be addressed immediately. But, often, when we’re looking at older data, we’re not trying to address immediate problems, we’re looking at longer range problems. Like, for example, are we running efficiently? What’s the average CPU utilization? That’s the kind of question that can be answered with aggregate data without the need to delve down into really fine-grained, detailed, data.

….in progress

--

--

Festus Nyarandi

I’m a Data Specialist who is skilled in solving complex problems with a focus on efficiency and performance optimization currently working for Safaricom PLC