Using SingleStore as a Time Series Database
Abstract
SingleStore is a very versatile database system. It is based upon Relational Technology and supports multi-model capabilities, such as Key-Value, JSON, Full-Text Search, Geospatial and Time Series.
This article will explore SingleStore’s support for Time Series data by using historical S&P 500 stock data from Kaggle. We’ll also build a quick dashboard to visualise candlestick charts using Streamlit.
The SQL scripts, Python code and notebook files used in this article are available on GitHub. The notebook files are available in DBC, HTML and iPython formats.
Introduction
Since the advent of Relational Database Technology, many new requirements to manage data have emerged. Luminaries, such as Martin Fowler, have proposed Polyglot Persistence as one solution for managing diverse data and data processing requirements, as shown in Figure 1.
However, Polyglot Persistence comes with costs and has attracted criticisms, such as:
In an often-cited post on polyglot persistence, Martin Fowler sketches a web application for a hypothetical retailer that uses each of Riak, Neo4j, MongoDB, Cassandra, and an RDBMS for distinct data sets. It’s not hard to imagine his retailer’s DevOps engineers quitting in droves.
— Stephen Pimentel
What I’ve seen in the past has been is if you try to take on six of these [technologies], you need a staff of 18 people minimum just to operate the storage side — say, six storage technologies. That’s not scalable and it’s too expensive.
— Dave McCrory
There have also been some proposals for using micro-services to implement a Polyglot Persistence architecture in recent years. However, SingleStore can provide a simpler solution by supporting diverse data types and processing requirements in a single multi-model database system. This offers many benefits, such as lower TCO, less burden upon developers to learn multiple products, no integration pains, and more. In a series of articles, we’ll discuss SingleStore’s multi-model capabilities in more detail. We’ll start with Time Series data.
To begin with, we need to create a free Managed Service account on the SingleStore website, and a free Community Edition (CE) account on the Databricks website. At the time of writing, the Managed Service account from SingleStore comes with $500 of Credits. This is more than adequate for the case study described in this article. For Databricks CE, we need to sign-up for the free account rather than the trial version. We are using Spark because, in a previous article, we noted that Spark was great for ETL with SingleStore.
If you don’t have an account at Kaggle, create one and download the all_stocks_5yr.csv file. The Kaggle website states that this file is 29.58 MB in size. The dataset consists of the following fields:
- date: Spans a five-year daily period from 8 February 2013 until 7 February 2018. No missing values.
- open: Opening price. 11 missing values.
- high: High price. 8 missing values.
- low: Low price. 8 missing values.
- close: Closing price. No missing values.
- volume: Total shares traded. No missing values.
- name: Trading symbol. 505 unique values. No missing values.
For our initial exploration, we’ll select date, close and name.
Configure Databricks CE
A previous article provides detailed instructions on how to Configure Databricks CE for use with SingleStore. We can use those exact instructions for this use case.
Upload CSV file
To use the CSV file, we need to upload it into the Databricks CE environment. A previous article provides detailed instructions on how to upload a CSV file. We can use those exact instructions for this use case.
Create the Database Table
In our SingleStore Managed Service account, let’s use the SQL Editor to create a new database. Call this timeseries_db, as follows:
CREATE DATABASE IF NOT EXISTS timeseries_db;
We’ll also create a table, as follows:
USE timeseries_db;
CREATE ROWSTORE TABLE IF NOT EXISTS tick (
ts DATETIME SERIES TIMESTAMP,
symbol VARCHAR(5),
price NUMERIC(18, 4),
KEY(ts)
);
Each row has a time-valued attribute called ts
. We’ll use DATETIME
rather than DATETIME(6)
, since we are not working with fractional seconds in this example. SERIES TIMESTAMP
specifies a table column as the default timestamp. We’ll create a KEY
on ts
as this will allow us to filter on ranges of values efficiently.
Fill out the Notebook
Let’s now create a new Databricks CE Python notebook. We’ll call it Data Loader for Time Series. We’ll attach our new notebook to our Spark cluster.
In a new code cell, let’s add the following code:
from pyspark.sql.types import *
tick_schema = StructType([
StructField("ts", TimestampType(), True),
StructField("open", DoubleType(), True),
StructField("high", DoubleType(), True),
StructField("low", DoubleType(), True),
StructField("price", DoubleType(), True),
StructField("volume", IntegerType(), True),
StructField("symbol", StringType(), True)
])
This schema ensures that we have the correct column types.
We’ll create a new Dataframe in the next code cell, as follows:
tick_df = spark.read.csv("/FileStore/all_stocks_5yr.csv",
header = True,
schema = tick_schema)
This reads the CSV file and creates a Dataframe called tick_df. We also tell Spark that there is a header row and ask it to use the previously defined schema.
In the next code cell, let’s get the number of rows:
tick_df.count()
Executing this, we obtain the value 619040.
We’ll remove some of the columns based upon our earlier decision for the initial analysis, as follows:
tick_df = tick_df.drop("open", "high", "low", "volume")
and sort the data:
tick_df = tick_df.sort("ts", "symbol")
In the next code cell, we’ll take a look at the structure of the Dataframe:
tick_df.show(10)
The output should look like this:
+-------------------+-------+------+
| ts| price|symbol|
+-------------------+-------+------+
|2013-02-08 00:00:00| 45.08| A|
|2013-02-08 00:00:00| 14.75| AAL|
|2013-02-08 00:00:00| 78.9| AAP|
|2013-02-08 00:00:00|67.8542| AAPL|
|2013-02-08 00:00:00| 36.25| ABBV|
|2013-02-08 00:00:00| 46.89| ABC|
|2013-02-08 00:00:00| 34.41| ABT|
|2013-02-08 00:00:00| 73.31| ACN|
|2013-02-08 00:00:00| 39.12| ADBE|
|2013-02-08 00:00:00| 45.7| ADI|
+-------------------+-------+------+
only showing top 10 rows
We are now ready to write the Dataframe to SingleStore. In the next code cell, we can add the following:
%run ./Setup
In the Setup notebook, we need to ensure that the server address and password have been added for our SingleStore Managed Service cluster.
In the next code cell, we’ll set some parameters for the SingleStore Spark Connector, as follows:
spark.conf.set("spark.datasource.singlestore.ddlEndpoint", cluster)
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", password)
spark.conf.set("spark.datasource.singlestore.disablePushdown", "false")
Finally, we are ready to write the Dataframe to SingleStore using the Spark Connector:
(tick_df.write
.format("singlestore")
.option("loadDataCompression", "LZ4")
.mode("ignore")
.save("timeseries_db.tick"))
This will write the Dataframe to the tick table in the timeseries_db database. We can check that this table was successfully populated from SingleStore.
Example Queries
Now that we have built our system, we can run some queries. SingleStore supports a range of useful functions for working with Time Series data. Let’s look at some examples.
Average Aggregate
The following query illustrates how to compute a simple average aggregate over all Time Series values in the table:
SELECT symbol, AVG(price)
FROM tick
GROUP BY symbol
ORDER BY symbol;
The output should be:
+--------+---------------+
| symbol | AVG(price) |
+--------+---------------+
| A | 49.20202542 |
| AAL | 38.39325226 |
| AAP | 132.43346307 |
| AAPL | 109.06669849 |
| ABBV | 60.86444003 |
... ...
Time Bucketing
Time bucketing can aggregate and group data for different time series by a fixed time interval. SingleStore supports several functions:
- FIRST: The value associated with the minimum timestamp. The documentation contains additional details and examples.
- LAST: The value associated with the maximum timestamp. The documentation contains additional details and examples.
- TIME_BUCKET: Normalizes time to the nearest bucket start time. The documentation contains additional details and examples.
For instance, we can use TIME_BUCKET
to find the average time series value grouped by five-day intervals, as follows:
SELECT symbol, TIME_BUCKET("5d", ts), AVG(price)
FROM tick
WHERE symbol = "AAPL"
GROUP BY 1, 2
ORDER BY 1, 2;
The output should be:
+--------+-----------------------+--------------+
| symbol | TIME_BUCKET("5d", ts) | AVG(price) |
+--------+-----------------------+--------------+
| AAPL | 2013-02-08 00:00:00.0 | 67.75280000 |
| AAPL | 2013-02-13 00:00:00.0 | 66.36943333 |
| AAPL | 2013-02-18 00:00:00.0 | 64.48960000 |
| AAPL | 2013-02-23 00:00:00.0 | 63.63516667 |
| AAPL | 2013-02-28 00:00:00.0 | 61.51996667 |
... ... ...
We can also combine these functions to create candlestick charts that show the high, low, open, and close for a stock over time, bucketed by a five-day window, as follows:
SELECT TIME_BUCKET("5d") AS ts,
symbol,
MIN(price) AS low,
MAX(price) AS high,
FIRST(price) AS open,
LAST(price) AS close
FROM tick
WHERE symbol = "AAPL"
GROUP BY 2, 1
ORDER BY 2, 1;
The output should be:
+------------+--------+----------+----------+----------+----------+
| ts | symbol | low | high | open | close |
+------------+--------+----------+----------+----------+----------+
| 2013-02-08 | AAPL | 66.8428 | 68.5614 | 67.8542 | 66.8428 |
| 2013-02-13 | AAPL | 65.7371 | 66.7156 | 66.7156 | 65.7371 |
| 2013-02-18 | AAPL | 63.7228 | 65.7128 | 65.7128 | 64.4014 |
| 2013-02-23 | AAPL | 63.2571 | 64.1385 | 63.2571 | 63.5099 |
| 2013-02-28 | AAPL | 60.0071 | 63.0571 | 63.0571 | 60.0071 |
... ... ... ... ... ...
Smoothing
We can smooth Time Series data using AVG
as a windowed aggregate. Here is an example where we are looking at the price and the moving average of price over the last three ticks:
SELECT symbol, ts, price, AVG(price)
OVER (ORDER BY ts ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS smoothed_price
FROM tick
WHERE symbol = "AAPL";
The output should be:
+--------+-----------------------+----------+----------------+
| symbol | ts | price | smoothed_price |
+--------+-----------------------+----------+----------------+
| AAPL | 2013-02-08 00:00:00.0 | 67.8542 | 67.85420000 |
| AAPL | 2013-02-11 00:00:00.0 | 68.5614 | 68.20780000 |
| AAPL | 2013-02-12 00:00:00.0 | 66.8428 | 67.75280000 |
| AAPL | 2013-02-13 00:00:00.0 | 66.7156 | 67.49350000 |
| AAPL | 2013-02-14 00:00:00.0 | 66.6556 | 67.19385000 |
... ... ... ...
AS OF
Finding a table row that is current AS OF a point in time is also a common Time Series requirement. This can be easily achieved using ORDER BY
and LIMIT
. Here is an example:
SELECT *
FROM tick
WHERE ts <= "2021-10-11 00:00:00"
AND symbol = "AAPL"
ORDER BY ts DESC
LIMIT 1;
The output should be:
+-----------------------+--------+----------+
| ts | symbol | price |
+-----------------------+--------+----------+
| 2018-02-07 00:00:00.0 | AAPL | 159.5400 |
+-----------------------+--------+----------+
Interpolation
Time Series data may have gaps. We can interpolate missing points. The SingleStore documentation provides an example stored procedure that can be used for this purpose when working with tick data.
Bonus: Streamlit Visualisation
Earlier, candlestick charts were mentioned, and it would be great to see these in a graphic rather than tabular format. We can do this quite easily with Streamlit. A previous article showed the ease with which we could connect Streamlit to SingleStore.
Install the Required Software
We need to install the following packages:
streamlit
pandas
plotly
pymysql
These can be found in the requirements.txt
file on GitHub. Run the file as follows:
pip install -r requirements.txt
Example Application
Here is the complete code listing for streamlit_app.py
:
# streamlit_app.py
import streamlit as st
import pandas as pd
import plotly.graph_objects as go
import pymysql
# Initialize connection.
def init_connection():
return pymysql.connect(**st.secrets["singlestore"])
conn = init_connection()
symbol = st.sidebar.text_input("Symbol", value = "AAPL", max_chars = None, key = None, type = "default")
num_days = st.sidebar.slider("Number of days", 2, 30, 5)
# Perform query.
data = pd.read_sql("""
SELECT TIME_BUCKET(%s) AS day,
symbol,
MIN(price) AS low,
MAX(price) AS high,
FIRST(price) AS open,
LAST(price) AS close
FROM tick
WHERE symbol = %s
GROUP BY 2, 1
ORDER BY 2, 1;
""", conn, params = (str(num_days) + "d", symbol.upper()))
st.subheader(symbol.upper())
fig = go.Figure(data = [go.Candlestick(
x = data["day"],
open = data["open"],
high = data["high"],
low = data["low"],
close = data["close"],
name = symbol,
)])
fig.update_xaxes(type = "category")
fig.update_layout(height = 700)
st.plotly_chart(fig, use_container_width = True)
st.write(data)
Create a Secrets file
Our local Streamlit application will read secrets from a file .streamlit/secrets.toml
in our applications root directory. We need to create this file as follows:
# .streamlit/secrets.toml
[singlestore]
host = "<TO DO>"
port = 3306
database = "timeseries_db"
user = "admin"
password = "<TO DO>"
The <TO DO>
for host and password should be replaced with the values obtained from the SingleStore Managed Service when creating a cluster.
Run the Code
We can run the Streamlit application as follows:
streamlit run streamlit_app.py
The output in a web browser should look like Figure 2.
On the web page, we can enter a new stock symbol in the text box and use the slider to change the number of days for TIME_BUCKET
. Feel free to experiment with the code to suit your needs.
Summary
This article showed that SingleStore is a capable solution for working with Time Series data. Using the power of SQL and built-in functions, we can achieve a great deal. SingleStore has extended its support for Time Series with the addition of FIRST
, LAST
and TIME_BUCKET
.
Acknowledgements
I thank Dr John Pickford for advice and pointers to suitable Time Series datasets.
I am also grateful to Part Time Larry for his excellent video on Streamlit — Building Financial Dashboards with Python and the GitHub code for inspiring the Streamlit Visualisation in this article.