Leveraging the Stargate Document API for Time Series Data in Cassandra
Author: Saurabh Verma
In this blog, you’ll learn how to use the Stargate Document API for Apache Cassandra® and build a simple TimeSeries DB for weather monitoring on top of the DataStax Astra DB. The demo focuses on the data model, data extraction, storage, and retrieval queries for the Weather monitoring app.
Introduction
Stargate for Apache Cassandra is a data API gateway deployed between the client applications and a database. The Stargate Document API modifies and queries data stored as unstructured JSON documents in collections, which gives the application developers native JSON support without having to give up any of the reliability and scalability goodness of Cassandra.
This Stargate Document API lets most Cassandra distributions (Cassandra 3.11, Cassandra 4.0, K8ssandra.io, DataStax Enterprise 6.8) work with JSON through a REST API. This deep dive details how Stargate Document API stores the JSON as C* column family data within Cassandra.
Time Series Data with the Cassandra Database
Cassandra’s data model works well with data in a sequence. That data can be variable in size, and Cassandra handles large amounts of data excellently.
Cassandra uses Log Structured Merge Trees (LSM-Trees) to store its data, not B-Trees. The LSM tree data structure is perfect for high write volumes, turning updates and deletes into new writes. With the wrong data model, reads are often an order of magnitude slower than writes, which is exactly the opposite of B-Trees. Thus, Cassandra’s underlying storage mechanism is highly compatible with Time Series Data.
When writing data to Cassandra, data is sorted and written sequentially to disk. When retrieving data by row key and then by range, you get a fast and efficient access pattern due to minimal disk seeks. Time series data is an excellent fit for this type of pattern. For these examples, we’ll use a weather station that creates temperature data every minute. You’ll see how using the row key and sequence can be a powerful data modeling tool.
Building a Weather Monitoring App with Cassandra and Stargate
Let’s consider the data retrieval needs of a weather monitoring app built to consume data produced by the weather station. This application would provide reports such as those in the table below.
Cassandra is well suited for supporting these types of reports, especially when the format of the underlying weather data is relatively static over time. However, let’s consider the case where the contents of weather reports can vary.
What Purpose Does the Stargate Document API Serve?
The data provided by weather.com has a considerable amount of location, time, and weather attributes. New attributes can be added or removed from the WeatherAPI.com response at any time. This example focuses on temperature-related characteristics for the locations, but can add additional attributes easily. Let’s start by querying for hierarchical attributes like:
- location (country -> region -> city)
- time granularity (monthly -> weekly -> daily)
This requires that the data storage format allows us to do at least two things:
- Accept any provided attributes at write time, and impose schema at read time.
- Store and query the data at hierarchical levels and not force us to flatten it before storing it in the data store.
Given these requirements, a Document-oriented approach with JSON formatted data would be suitable for this use case, which is where the Stargate Document API comes into play. The Document API allows the application to store arbitrary JSON documents in a collection, while under the hood, still leveraging the efficiencies Cassandra provides for reading and writing time series data.
Weather Application Architecture
Let’s look in more detail at a sample weather application.
API Specification
The application is a Next.js app located here in Github. Below are the API endpoints exposed by this application and a sample query for each endpoint.
- /cities → POST endpoint to add a new city to the application for weather monitoring.
curl --location --request POST 'localhost:3000/api/cities' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "Austin"
}'
2. /fetcher → POST endpoint that triggers async pull of weather data for a given city and a date.
curl --location --request POST 'localhost:3000/api/fetcher' \
--header 'Content-Type: application/json' \
--data-raw '{
"city": "Seattle",
"numDays": 7,
"forecastDate": "2022-10-01"
}'
3. /report1 → POST endpoint fetching the weekly temperature report by day for a given location (Report 1).
curl --location --request POST 'localhost:3000/api/report1' \
--header 'Content-Type: application/json' \
--data-raw '{
"city": "London",
"date": "2022-11-01"
}'
4. /report2 → POST endpoint fetching the weekly temperature comparison for 2 given locations and a given week (Report 2).
curl --location --request POST 'localhost:3000/api/report2' \
--header 'Content-Type: application/json' \
--data-raw '{
"city1": "London",
"city2": "Seattle",
"date": "2022-11-01"
}'
Application Components
The figure below shows the components that make up our weather application and how it receives and stores data.
Let’s look at each of the numbered components below in more detail so you can run this example yourself.
(1) Data Source
The data source is WeatherAPI.com which provides JSON APIs for historical and forecast weather data for specified locations.
WeatherAPI.com API Keys and Authentication
In order to be able to access the weather data using the history API, you need to sign up at WeatherAPI.com and get an API Key, as shown below.
To view the request and the response format from the history API, open the API Explorer tool from Weather API and input your API Key as below. To request historical data, you must input the API Key for each API call.
Here’s an example response from the forecast API for London on 2022–11–09:
(2) Application Configuration
The application configuration is stored in files .env.* in the root folder and named according to the patterns described in the following table.
The table below describes the application configuration parameters.
To configure your instance of the application, you’ll populate the values shown in the .env.local file.
ASTRA_DB_ID=<ASTRA_DB_ID>
ASTRA_DB_REGION=<ASTRA_DB_REGION>
ASTRA_DB_APPLICATION_TOKEN=<ASTRA_DB_APPLICATION_TOKEN>
ASTRA_DB_KEYSPACE=<ASTRA_DB_KEYSPACE>
WEATHER_API_KEY=<WEATHER_API_KEY>
(3) Weather Data Extractor (/fetch API)
Source / raw data nearly always needs some transformation when applied to a new use case. This application is no exception. The queries mentioned here are often fetching the data at weekly granularity, whereas the data made available by the weather.com API is at a daily level.
The /fetch API can solve this, by fetching the data from the weatherapi.com API. Then, it converts day-level data as a member in the weekly granularity document for the particular city in Apache Cassandra as shown below.
The /fetch API uses the Stargate Document API Collections to store the weather data. Additionally, the /fetch API stores the weekly weather document with the key as <cityName_weekNumber>.
The document stored in C* via the Stargate Document API has the key as <city_weekNumber> seen below for city Boston and week 26:
{
"documentId": "Boston_26", ←— documentKey aka primary key
"data": {
"city": "Boston",
"week": 26,
"forecast": [
{
"date": "2022-10-24",
"date_epoch": 1666569600,
"stats": {
"day": {
"mintemp_f": 50,
"maxtemp_f": 68,
"humidity": 85
}
}
}
]
}
}
This makes weekly querying for city a point lookup using the documentKey.
(4) TimeSeries Queries (/report1, /report2 APIs)
Now let’s examine the querying capabilities of the Stargate Document API via a set of TimeSeries queries for each report from here. Want to try the queries yourself? They are available as a Postman collection on the DataStax Teamspace on the Postman API Network.
Report 1: Weekly temperature report by day for a given location and a given week
Report 1 is a point query using the primary key on the data model from the developers’ point of view and exposed as /report1 API endpoint. For example, you could use the following query to query daily min and max temperatures for Boston for week 26.
curl --location --request POST 'localhost:3000/api/report1' \
--header 'Content-Type: application/json' \
--data-raw '{
"city": "Seattle",
"date": "2022-11-01"
}'
The response to this query will look something like this:
{
"forecast": [
{
"date": "2022-10-30",
"stats": {
"day": {
"maxtemp_f": 53.2,
"mintemp_f": 47.8
}
}
},
{
"date": "2022-10-31",
"stats": {
"day": {
"maxtemp_f": 55.2,
"mintemp_f": 45.7
}
}
},...
]
}
The resulting plot in the app looks like this:
Report 2: Weekly temperature comparison for two given locations and a given week
Report 2 is a bulk get with filters example and exposed as /report2 API endpoint. For example, you could use the following query to get a weekly temperature comparison for Boston and Seattle for week 26.
curl --location --request POST 'localhost:3000/api/report2' \
--header 'Content-Type: application/json' \
--data-raw '{
"city1": "London",
"city2": "Seattle",
"date": "2022-11-01"
}'
The response to this query will look something like this:
{
"forecast": [
{
"date": "2022-10-30",
"stats": {
"city1":
{
"city": "London",
"day": {
"maxtemp_f": 53.2,
"mintemp_f": 47.8
}},
"city2":
{
"city": "Seattle",
"day": {
"maxtemp_f": 43.1,
"mintemp_f": 37.8
}}
}
},
{
"date": "2022-10-31",
"stats": {
"city1":
{
"city": "London",
"day": {
"maxtemp_f": 52.2,
"mintemp_f": 45.8
}},
"city2":
{
"city": "Seattle",
"day": {
"maxtemp_f": 47.1,
"mintemp_f": 39.8
}}
}
},...
]
}
Report 3: Min and max temperatures for a location and a week
Report 3 is an example of an aggregation query since you have to compute the min and max across the week. What we have stored is the min and max per day. As of today, aggregations are not supported by the Stargate Document API and thus the onus lies on the client to do any aggregations. For example, to get the weekly min and max temperatures for Boston for week 26, the client would execute /report1 via the Document API to get the temperature values for the week and then compute the min and max locally.
Time Series Related Improvements for the Stargate Document API
Based on the analysis above, we can see several possible ways in which extending the Document API could support additional functionality and take on some of the work currently performed by the application:
- Support for aggregation querying functions on the returned documents.
- Namespace and Collection level metadata. For example, the number of documents in a collection, the number of rows, or a list of distinct keys. This data could be useful for computing page sizes or deciding whether to retrieve data synchronously or asynchronously at runtime.
- Server-side sorting on a specific field.
- Precomputed key level aggregates — average, median, min/max, etc. during ingestion of the documents to Stargate.
Conclusion
Time series is one of the compelling data models for Cassandra. It’s a natural fit for the big table data model and scales well under various variations. There are lots of machine-generated data these days, so there’s no shortage of use cases similar to the examples above. For those users, the problem of storing data at machine-generating speeds and still keeping it organized in a useful manner is no longer a challenge.
The Stargate Document API transforms Apache Cassandra into a Document DB, a full-fledged native JSON data service over Cassandra. With no upfront schema definition or data modeling, developers can handle unstructured, semi-structured, and structured JSON while leveraging Cassandra’s write speed, reliability, and scalability guarantees for a production-ready data store.
The Document API provides a wide range of write and query operators — for equality, range queries, array contains, server-side array push/pop. Additionally, the developers can also query at all the nested levels without local computations.
Try this out on the Astra DB free plan in the sample app gallery, or locally as a Node JS app.
Follow the DataStax Tech Blog for more developer stories. Check our YouTube channel for tutorials and DataStax Developers on Twitter for the latest news about our developer community.
Resources
- Data Modeling by Example: Time Series Data Model
- Advanced Time Series Data Modelling
- Time Series Modeling at Massive Scale
- DateTieredCompactionStrategy (DTCS)
- On Demand Webinar: Time Series Data Management at Scale with Apache Cassandra™
- Apache Cassandra and Timeseries: BFF’s?
- Join Stargate Discord
- Stargate Twitter
- Stargate Github discussions
- DataStax Certifications
- DataStax Workshops