Advanced SQL: Time Series Analysis

Mohammad Nurrokim
Analytics Vidhya
Published in
9 min readAug 31, 2021

--

Photo by Aron Visuals on Unsplash

Time Series

Time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.

Time series data is a sequence of data points. Each of data points includes a time stamp. Time stamps usually include a date and then a time.

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.

While regression analysis is often employed in such a way as to test relationships between one or more different time series, this type of analysis is not usually called “time series analysis”, which refers in particular to relationships between different points in time within a single series.

Time Series Use Case

  • Stock market data
  • Tide time tables
  • Performance monitoring
  • Health monitoring
  • Population statistics
  • Business performance

Time Series Analysis Using SQL(?)

Somehow, Python or R are most often and knows as tools to conduct deep time series analysis. But, knowledge of how to work with time series data using SQL is essential. It’s important when working with very large datasets or data that is constantly being updated.

As Data Scientist, we already know that SQL is one of popular language to work with dataset. SQL is a widely known, well documented, and expressive querying language, and the 3rd most popular development language as of writing. It’s also easy for organizations to adopt and integrate with other tools. For these reasons, I we believe SQL is one of the best language for working with your time-series data.

In this post, we’ve rounded up a few of our top tips and reasons why we love SQL for time-series analysis, including technical resources and guidance to help you how to work with time series dataset. As I said before that SQL is easy to integrate with other tools. In this post, I wouldn’t use MySQL Workbench. I prefer to integrate Python and MySQL Server.

Time Series Dataset: Superstore

I have downloaded time series dataset from Kaggle. You also can download this dataset here: Superstore Time Series Dataset. In this post, I only use superstore_test.csv.

Data Preprocessing

Before we import this dataset (superstore.csv) to SQL Server, we need to do data preprocessing first.

  1. Convert Order Data and Ship Date using date format: %Y-%m-%d %H:%M:%S
  2. Rename columns name. In SQL we can’t have column name with space. So we need convert it with underscore.

After these data processing was done, save it to new csv file.

Import CSV File into SQL Server

This tutorial will guide you how to import superstore.csv (Data Preprocessing Result) to SQL Server. This guidance using MySQL Workbench and will guide you step by step.

Click New Schema button and type Schema’s Name. In this post, I use ‘Superstore’ name.
Click “Apply”
Right Click on “Superstore” Schema, and then Click “Table Data Import Wizard”
Browse where is your CSV file after Data Preprocessing. Then, click “Next”.
Click “Create New Table”. Then give table’s name. In this post, I would give “superstore” as my table’s name.
Don’t forget to change “File Type” in “Order Date” and “Ship Date” Columns. Then click “Next”.
Just wait. After finished, click “Next”.
See, there are 7762 records imported. We have already finished. Click “Finish”.
Try SELECT statement to see your new “Superstore” table.

Congrats! Your dataset is ready.

In this post, I don’t use MySQL Workbench to analyze time series data. But, I would to use Python MySQL Connector. How to use it? I’ll explain it below.

Python MySQL Connector

MySQL Connector enables Python programs to access SQL Server. First, we create connection to SQL Server as Python Function. Then, use this function to write SQL query.

Life is hard. Don’t make it harder.

To make it easier for us to write SQL queries, we create a function. See this code below.

Commonly Used Functions for Time Series Analysis

Time series dataset was ready. Function to write SQL queries in notebook can’t wait to use.

Working with time series dataset need some SQL Function. I will introduce LEAD, LAG, RANK, and PERCENT_RANK as your new weapons.

1. LEAD

LEAD() is a window function that provides access to a row at a specified physical offset which follows the current row. For example, by using the LEAD() function, from the current row, you can access data of the next row, or the row after the next row, and so on.

The LEAD() function can be very useful for comparing the value of the current row with the value of the following row.

In this query, we create new column named sales_lead. We used LEAD to show next sales value in sales_lead. See full queries below.

2. LAG

LAG() is a window function that provides access to a row at a specified physical offset which comes before the current row. In other words, LAG is same with LEAD, but different direction.

By using the LAG() function, from the current row, you can access data of the previous row, or from the second row before the current row, and so on.

In this query, we create new column named sales_lag. We used LAG to show previous sales value in sales_lag. See full queries below.

3. Rank

The RANK() function is a window function that assigns a rank to each row within a partition of a result set. The rows within a partition that have the same values will receive the same rank. The rank of the first row within a partition is one.

In query below, we add new column named sales_rank to create ranking from biggest sales (descending).

4. PERCENT_RANK

The PERCENT_RANK() is actually has same objective with RANK function. But, PERCENT_RANK calculates the percentile ranking of rows in a result set. This function returns a percentile ranking number which ranges from zero to one.

For a specific row, PERCENT_RANK() uses the following formula to calculate the percentile rank:

(rank - 1) / (total_rows - 1)

Based on this formula, the PERCENT_RANK() function always returns zero for the first row the result set. See sales_rank column below as PERCENT_RANK output.

Time Series Analysis

It’s better if we have a lot of tools in toolbox to working with time series dataset. Beside function I mentioned before, I think you also need to understand advanced SQL queries: Common Table Expression (CTE) and Window Function.

Basically, we will use CTE and Window Function in this time series analysis. I hope all of you get more bonus from me, so add how to compare previous value, moving average, and linear regression.

1. Common Table Expression (CTE)

A Common Table Expression (CTE) allows you to define a temporary named result set.

This result set available temporarily in the execution scope of a statement such as SELECT, INSERT, UPDATE, DELETE, or MERGE. The CTE can also be used in a View.

a) Monthly Sales Average using DATE_FORMAT Function

Query below create temporary table to show monthly sales average using CTE. And to change date format, we use DATE_FORMAT().

b) Monthly Sales Average using EXTRACT Function

This query is actually same with previous one, but we use different function to manipulate date format. See query below, we use EXTRACT function.

c) Daily Sales Average during 2014–2017

Clear enough. This query will show you Daily Sales Average.

d) Sales Average from 1st -31 every month

It’s quite different. This query will serve you sales averages in first day, second day, third day, and so on every month. See full query below.

2. Calculating Aggregates Over Windows Function

Window function is a function which uses values from one or multiple rows to return a value for each row.

This contrasts with an aggregate function, which returns a single value for multiple rows.

Window functions have an OVER clause. Please keep in mind that any function without an OVER clause is not a window function, but rather an aggregate or single-row (scalar) function.

Window functions operate on a set of rows and return a single aggregated value for each row. And “Window” term describes the set of rows in the database on which the function will operate. We define the Window (set of rows on which functions operates) using an OVER() clause.

Types of Window Functions:

  1. Aggregate Window Functions
    SUM(), MAX(), MIN(), AVG(). COUNT()
  2. Ranking Window Functions
    RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE()
  3. Value Window Functions
    LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE()

Average Daily Profit using Aggregate Window Function

This query below is aggregate Window Function example. We use AVG() and OVER() to show Daily Profit Average. “Daily” term used because we use Order_Date column in PARTITION BY.

3. Previous Day Comparison

How to compare aggregate value in day by day? In this query sample, we compare daily discount average and daily sales average with previous one using CTE.

Comparing Average Discount with Previous Day

Comparing Average Sales with Previous Day

4. Moving Averages

Moving Averages is advanced SQL use case. Calculating the moving average is useful to get an estimate about the trends in a data set. The moving average is the average of any subset of numbers over a period of time. The MOVING AVERAGE is a time series technique for analyzing and determining trends in data.

Sometimes Moving Averages called rolling means, rolling averages, or running averages. This functions are calculated as the mean of the current and a specified number of immediately preceding values for each point in time.

Moving Averages using = Window Function (OVER) function + ORDER BY() Function + BETWEEN (……A…….) AND (……B…….).

A and B can be filled by:

  1. CURRENT ROW: active row or latest row
  2. FOLLOWING : following/next row
  3. PRECEDING: previous row
  4. UNBOUDNED FOLLOWING: last row
  5. UNBOUNDED PRECEDING: first row

Moving Average and Summary of Profit

This query below, we create new columns: Cumulative Sum Profit and Cumulative Average Profit using Moving Overage. As we see in this query, Moving Average using Aggregate Window Function (SUM/AVG + OVER).

5. Forecasting with Linear Regression

Linear regression is a linear approach for modelling the relationship between dependent and independent variables. The case of one independent variables is called simple linear regression.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome’ or ‘response’ variable) and one or more independent variables (often called ‘predictors’, ‘covariates’, ‘explanatory variables’ or ‘features’).

The most common form of regression analysis is linear regression. This regression finds the line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion.

Linear Regression Formula:

c = intercept
c is the constant value — this part of the function does not change.

m = slope
m is the gradient of the line.

In this case, discount is feature or x variable. Profit is target value or y variable. First move, we need to calculate x bar and y bar which is needed to find slope and intercept values.

Second move, find intercept and slope values using this query.

Finally, use slope and intercept values to show profit trend line. See this query.

How is the relationship between discount and trendline profit? Check out this scatter chart. The bigger discount we give to customers, it will lower our profit. It’s make sense.

So, how is profit trend line monthly? Let’s see this line plot.

So, it’s all functions in Advanced SQL and query example how to analyze time series dataset.

Thank you for reading this post. I hope you get actionable insight and easily guide you to analyze your dataset. See you in my next post.

--

--

Mohammad Nurrokim
Analytics Vidhya

Data Analytics | Contributing Writer at Analytics Vidhya | Senior Data Analyst at Tokopedia (GoTo Company)