Using SingleStore as a Time Series Database

Akmal Chaudhri
9 min readOct 11, 2021

--

Abstract

SingleStore is a very versatile database system. It is based upon Relational Technology and supports multi-model capabilities, such as Key-Value, JSON, Full-Text Search, Geospatial and Time Series.

This article will explore SingleStore’s support for Time Series data by using historical S&P 500 stock data from Kaggle. We’ll also build a quick dashboard to visualise candlestick charts using Streamlit.

The SQL scripts, Python code and notebook files used in this article are available on GitHub. The notebook files are available in DBC, HTML and iPython formats.

Introduction

Since the advent of Relational Database Technology, many new requirements to manage data have emerged. Luminaries, such as Martin Fowler, have proposed Polyglot Persistence as one solution for managing diverse data and data processing requirements, as shown in Figure 1.

Figure 1. Polyglot Persistence.

However, Polyglot Persistence comes with costs and has attracted criticisms, such as:

In an often-cited post on polyglot persistence, Martin Fowler sketches a web application for a hypothetical retailer that uses each of Riak, Neo4j, MongoDB, Cassandra, and an RDBMS for distinct data sets. It’s not hard to imagine his retailer’s DevOps engineers quitting in droves.
— Stephen Pimentel

and also:

What I’ve seen in the past has been is if you try to take on six of these [technologies], you need a staff of 18 people minimum just to operate the storage side — say, six storage technologies. That’s not scalable and it’s too expensive.
— Dave McCrory

There have also been some proposals for using micro-services to implement a Polyglot Persistence architecture in recent years. However, SingleStore can provide a simpler solution by supporting diverse data types and processing requirements in a single multi-model database system. This offers many benefits, such as lower TCO, less burden upon developers to learn multiple products, no integration pains, and more. In a series of articles, we’ll discuss SingleStore’s multi-model capabilities in more detail. We’ll start with Time Series data.

To begin with, we need to create a free Managed Service account on the SingleStore website, and a free Community Edition (CE) account on the Databricks website. At the time of writing, the Managed Service account from SingleStore comes with $500 of Credits. This is more than adequate for the case study described in this article. For Databricks CE, we need to sign-up for the free account rather than the trial version. We are using Spark because, in a previous article, we noted that Spark was great for ETL with SingleStore.

If you don’t have an account at Kaggle, create one and download the all_stocks_5yr.csv file. The Kaggle website states that this file is 29.58 MB in size. The dataset consists of the following fields:

  • date: Spans a five-year daily period from 8 February 2013 until 7 February 2018. No missing values.
  • open: Opening price. 11 missing values.
  • high: High price. 8 missing values.
  • low: Low price. 8 missing values.
  • close: Closing price. No missing values.
  • volume: Total shares traded. No missing values.
  • name: Trading symbol. 505 unique values. No missing values.

For our initial exploration, we’ll select date, close and name.

Configure Databricks CE

A previous article provides detailed instructions on how to Configure Databricks CE for use with SingleStore. We can use those exact instructions for this use case.

Upload CSV file

To use the CSV file, we need to upload it into the Databricks CE environment. A previous article provides detailed instructions on how to upload a CSV file. We can use those exact instructions for this use case.

Create the Database Table

In our SingleStore Managed Service account, let’s use the SQL Editor to create a new database. Call this timeseries_db, as follows:

CREATE DATABASE IF NOT EXISTS timeseries_db;

We’ll also create a table, as follows:

USE timeseries_db;

CREATE ROWSTORE TABLE IF NOT EXISTS tick (
ts DATETIME SERIES TIMESTAMP,
symbol VARCHAR(5),
price NUMERIC(18, 4),
KEY(ts)
);

Each row has a time-valued attribute called ts. We’ll use DATETIME rather than DATETIME(6), since we are not working with fractional seconds in this example. SERIES TIMESTAMP specifies a table column as the default timestamp. We’ll create a KEY on ts as this will allow us to filter on ranges of values efficiently.

Fill out the Notebook

Let’s now create a new Databricks CE Python notebook. We’ll call it Data Loader for Time Series. We’ll attach our new notebook to our Spark cluster.

In a new code cell, let’s add the following code:

from pyspark.sql.types import *

tick_schema = StructType([
StructField("ts", TimestampType(), True),
StructField("open", DoubleType(), True),
StructField("high", DoubleType(), True),
StructField("low", DoubleType(), True),
StructField("price", DoubleType(), True),
StructField("volume", IntegerType(), True),
StructField("symbol", StringType(), True)
])

This schema ensures that we have the correct column types.

We’ll create a new Dataframe in the next code cell, as follows:

tick_df = spark.read.csv("/FileStore/all_stocks_5yr.csv",
header = True,
schema = tick_schema)

This reads the CSV file and creates a Dataframe called tick_df. We also tell Spark that there is a header row and ask it to use the previously defined schema.

In the next code cell, let’s get the number of rows:

tick_df.count()

Executing this, we obtain the value 619040.

We’ll remove some of the columns based upon our earlier decision for the initial analysis, as follows:

tick_df = tick_df.drop("open", "high", "low", "volume")

and sort the data:

tick_df = tick_df.sort("ts", "symbol")

In the next code cell, we’ll take a look at the structure of the Dataframe:

tick_df.show(10)

The output should look like this:

+-------------------+-------+------+
| ts| price|symbol|
+-------------------+-------+------+
|2013-02-08 00:00:00| 45.08| A|
|2013-02-08 00:00:00| 14.75| AAL|
|2013-02-08 00:00:00| 78.9| AAP|
|2013-02-08 00:00:00|67.8542| AAPL|
|2013-02-08 00:00:00| 36.25| ABBV|
|2013-02-08 00:00:00| 46.89| ABC|
|2013-02-08 00:00:00| 34.41| ABT|
|2013-02-08 00:00:00| 73.31| ACN|
|2013-02-08 00:00:00| 39.12| ADBE|
|2013-02-08 00:00:00| 45.7| ADI|
+-------------------+-------+------+
only showing top 10 rows

We are now ready to write the Dataframe to SingleStore. In the next code cell, we can add the following:

%run ./Setup

In the Setup notebook, we need to ensure that the server address and password have been added for our SingleStore Managed Service cluster.

In the next code cell, we’ll set some parameters for the SingleStore Spark Connector, as follows:

spark.conf.set("spark.datasource.singlestore.ddlEndpoint", cluster)
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", password)
spark.conf.set("spark.datasource.singlestore.disablePushdown", "false")

Finally, we are ready to write the Dataframe to SingleStore using the Spark Connector:

(tick_df.write
.format("singlestore")
.option("loadDataCompression", "LZ4")
.mode("ignore")
.save("timeseries_db.tick"))

This will write the Dataframe to the tick table in the timeseries_db database. We can check that this table was successfully populated from SingleStore.

Example Queries

Now that we have built our system, we can run some queries. SingleStore supports a range of useful functions for working with Time Series data. Let’s look at some examples.

Average Aggregate

The following query illustrates how to compute a simple average aggregate over all Time Series values in the table:

SELECT symbol, AVG(price)
FROM tick
GROUP BY symbol
ORDER BY symbol;

The output should be:

+--------+---------------+
| symbol | AVG(price) |
+--------+---------------+
| A | 49.20202542 |
| AAL | 38.39325226 |
| AAP | 132.43346307 |
| AAPL | 109.06669849 |
| ABBV | 60.86444003 |
... ...

Time Bucketing

Time bucketing can aggregate and group data for different time series by a fixed time interval. SingleStore supports several functions:

  • FIRST: The value associated with the minimum timestamp. The documentation contains additional details and examples.
  • LAST: The value associated with the maximum timestamp. The documentation contains additional details and examples.
  • TIME_BUCKET: Normalizes time to the nearest bucket start time. The documentation contains additional details and examples.

For instance, we can use TIME_BUCKET to find the average time series value grouped by five-day intervals, as follows:

SELECT symbol, TIME_BUCKET("5d", ts), AVG(price)
FROM tick
WHERE symbol = "AAPL"
GROUP BY 1, 2
ORDER BY 1, 2;

The output should be:

+--------+-----------------------+--------------+
| symbol | TIME_BUCKET("5d", ts) | AVG(price) |
+--------+-----------------------+--------------+
| AAPL | 2013-02-08 00:00:00.0 | 67.75280000 |
| AAPL | 2013-02-13 00:00:00.0 | 66.36943333 |
| AAPL | 2013-02-18 00:00:00.0 | 64.48960000 |
| AAPL | 2013-02-23 00:00:00.0 | 63.63516667 |
| AAPL | 2013-02-28 00:00:00.0 | 61.51996667 |
... ... ...

We can also combine these functions to create candlestick charts that show the high, low, open, and close for a stock over time, bucketed by a five-day window, as follows:

SELECT TIME_BUCKET("5d") AS ts,
symbol,
MIN(price) AS low,
MAX(price) AS high,
FIRST(price) AS open,
LAST(price) AS close
FROM tick
WHERE symbol = "AAPL"
GROUP BY 2, 1
ORDER BY 2, 1;

The output should be:

+------------+--------+----------+----------+----------+----------+
| ts | symbol | low | high | open | close |
+------------+--------+----------+----------+----------+----------+
| 2013-02-08 | AAPL | 66.8428 | 68.5614 | 67.8542 | 66.8428 |
| 2013-02-13 | AAPL | 65.7371 | 66.7156 | 66.7156 | 65.7371 |
| 2013-02-18 | AAPL | 63.7228 | 65.7128 | 65.7128 | 64.4014 |
| 2013-02-23 | AAPL | 63.2571 | 64.1385 | 63.2571 | 63.5099 |
| 2013-02-28 | AAPL | 60.0071 | 63.0571 | 63.0571 | 60.0071 |
... ... ... ... ... ...

Smoothing

We can smooth Time Series data using AVG as a windowed aggregate. Here is an example where we are looking at the price and the moving average of price over the last three ticks:

SELECT symbol, ts, price, AVG(price)
OVER (ORDER BY ts ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS smoothed_price
FROM tick
WHERE symbol = "AAPL";

The output should be:

+--------+-----------------------+----------+----------------+
| symbol | ts | price | smoothed_price |
+--------+-----------------------+----------+----------------+
| AAPL | 2013-02-08 00:00:00.0 | 67.8542 | 67.85420000 |
| AAPL | 2013-02-11 00:00:00.0 | 68.5614 | 68.20780000 |
| AAPL | 2013-02-12 00:00:00.0 | 66.8428 | 67.75280000 |
| AAPL | 2013-02-13 00:00:00.0 | 66.7156 | 67.49350000 |
| AAPL | 2013-02-14 00:00:00.0 | 66.6556 | 67.19385000 |
... ... ... ...

AS OF

Finding a table row that is current AS OF a point in time is also a common Time Series requirement. This can be easily achieved using ORDER BY and LIMIT. Here is an example:

SELECT *
FROM tick
WHERE ts <= "2021-10-11 00:00:00"
AND symbol = "AAPL"
ORDER BY ts DESC
LIMIT 1;

The output should be:

+-----------------------+--------+----------+
| ts | symbol | price |
+-----------------------+--------+----------+
| 2018-02-07 00:00:00.0 | AAPL | 159.5400 |
+-----------------------+--------+----------+

Interpolation

Time Series data may have gaps. We can interpolate missing points. The SingleStore documentation provides an example stored procedure that can be used for this purpose when working with tick data.

Bonus: Streamlit Visualisation

Earlier, candlestick charts were mentioned, and it would be great to see these in a graphic rather than tabular format. We can do this quite easily with Streamlit. A previous article showed the ease with which we could connect Streamlit to SingleStore.

Install the Required Software

We need to install the following packages:

streamlit
pandas
plotly
pymysql

These can be found in the requirements.txt file on GitHub. Run the file as follows:

pip install -r requirements.txt

Example Application

Here is the complete code listing for streamlit_app.py:

# streamlit_app.py

import streamlit as st
import pandas as pd
import plotly.graph_objects as go
import pymysql

# Initialize connection.

def init_connection():
return pymysql.connect(**st.secrets["singlestore"])

conn = init_connection()
symbol = st.sidebar.text_input("Symbol", value = "AAPL", max_chars = None, key = None, type = "default")
num_days = st.sidebar.slider("Number of days", 2, 30, 5)

# Perform query.

data = pd.read_sql("""
SELECT TIME_BUCKET(%s) AS day,
symbol,
MIN(price) AS low,
MAX(price) AS high,
FIRST(price) AS open,
LAST(price) AS close
FROM tick
WHERE symbol = %s
GROUP BY 2, 1
ORDER BY 2, 1;
""", conn, params = (str(num_days) + "d", symbol.upper()))

st.subheader(symbol.upper())

fig = go.Figure(data = [go.Candlestick(
x = data["day"],
open = data["open"],
high = data["high"],
low = data["low"],
close = data["close"],
name = symbol,
)])

fig.update_xaxes(type = "category")
fig.update_layout(height = 700)

st.plotly_chart(fig, use_container_width = True)

st.write(data)

Create a Secrets file

Our local Streamlit application will read secrets from a file .streamlit/secrets.toml in our applications root directory. We need to create this file as follows:

# .streamlit/secrets.toml

[singlestore]
host = "<TO DO>"
port = 3306
database = "timeseries_db"
user = "admin"
password = "<TO DO>"

The <TO DO> for host and password should be replaced with the values obtained from the SingleStore Managed Service when creating a cluster.

Run the Code

We can run the Streamlit application as follows:

streamlit run streamlit_app.py

The output in a web browser should look like Figure 2.

Figure 2. Streamlit.

On the web page, we can enter a new stock symbol in the text box and use the slider to change the number of days for TIME_BUCKET. Feel free to experiment with the code to suit your needs.

Summary

This article showed that SingleStore is a capable solution for working with Time Series data. Using the power of SQL and built-in functions, we can achieve a great deal. SingleStore has extended its support for Time Series with the addition of FIRST, LAST and TIME_BUCKET.

Acknowledgements

I thank Dr John Pickford for advice and pointers to suitable Time Series datasets.

I am also grateful to Part Time Larry for his excellent video on Streamlit — Building Financial Dashboards with Python and the GitHub code for inspiring the Streamlit Visualisation in this article.

--

--

Akmal Chaudhri

I help build global developer communities and raise awareness of technology through presentations and technical writing.