Leveraging the Stargate Document API for Time Series Data in Cassandra

Author:

In this blog, you’ll learn how to use the Stargate Document API for Apache Cassandra® and build a simple TimeSeries DB for weather monitoring on top of the DataStax Astra DB. The demo focuses on the data model, data extraction, storage, and retrieval queries for the Weather monitoring app.

Introduction

for Apache Cassandra is a data API gateway deployed between the client applications and a database. The Stargate Document API modifies and queries data stored as unstructured JSON documents in collections, which gives the application developers native JSON support without having to give up any of the reliability and scalability goodness of Cassandra.

This Stargate Document API lets most Cassandra distributions (Cassandra 3.11, Cassandra 4.0, K8ssandra.io, DataStax Enterprise 6.8) work with JSON through a REST API. This details how Stargate Document API stores the JSON as C* column family data within Cassandra.

Time Series Data with the Cassandra Database

Cassandra’s data model works well with data in a sequence. That data can be variable in size, and Cassandra handles large amounts of data excellently.

Cassandra uses (LSM-Trees) to store its data, not B-Trees. The LSM tree data structure is perfect for high write volumes, turning updates and deletes into new writes. With the wrong data model, reads are often an order of magnitude slower than writes, which is exactly the opposite of B-Trees. Thus, Cassandra’s underlying storage mechanism is highly compatible with Time Series Data.

When writing data to Cassandra, data is sorted and written sequentially to disk. When retrieving data by row key and then by range, you get a fast and efficient access pattern due to minimal disk seeks. Time series data is an excellent fit for this type of pattern. For these examples, we’ll use a weather station that creates temperature data every minute. You’ll see how using the row key and sequence can be a powerful data modeling tool.

Building a Weather Monitoring App with Cassandra and Stargate

Let’s consider the data retrieval needs of a weather monitoring app built to consume data produced by the weather station. This application would provide reports such as those in the table below.

Cassandra is well suited for supporting these types of reports, especially when the format of the underlying weather data is relatively static over time. However, let’s consider the case where the contents of weather reports can vary.

What Purpose Does the Stargate Document API Serve?

The data provided by weather.com has a considerable amount of location, time, and weather attributes. New attributes can be added or removed from the response at any time. This example focuses on temperature-related characteristics for the locations, but can add additional attributes easily. Let’s start by querying for hierarchical attributes like:

  • location (country -> region -> city)
  • time granularity (monthly -> weekly -> daily)

This requires that the data storage format allows us to do at least two things:

  1. Accept any provided attributes at write time, and impose schema at read time.
  2. Store and query the data at hierarchical levels and not force us to flatten it before storing it in the data store.

Given these requirements, a Document-oriented approach with JSON formatted data would be suitable for this use case, which is where the Stargate Document API comes into play. The Document API allows the application to store arbitrary JSON documents in a collection, while under the hood, still leveraging the efficiencies Cassandra provides for reading and writing time series data.

Weather Application Architecture

Let’s look in more detail at a sample weather application.

API Specification

The application is a Next.js app located in Github. Below are the API endpoints exposed by this application and a sample query for each endpoint.

  1. /cities → POST endpoint to add a new city to the application for weather monitoring.
curl --location --request POST 'localhost:3000/api/cities' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "Austin"
}'

2. /fetcher → POST endpoint that triggers async pull of weather data for a given city and a date.

curl --location --request POST 'localhost:3000/api/fetcher' \
--header 'Content-Type: application/json' \
--data-raw '{
"city": "Seattle",
"numDays": 7,
"forecastDate": "2022-10-01"
}'

3. /report1 → POST endpoint fetching the weekly temperature report by day for a given location ().

curl --location --request POST 'localhost:3000/api/report1' \
--header 'Content-Type: application/json' \
--data-raw '{
"city": "London",
"date": "2022-11-01"
}'

4. /report2 → POST endpoint fetching the weekly temperature comparison for 2 given locations and a given week ().

curl --location --request POST 'localhost:3000/api/report2' \
--header 'Content-Type: application/json' \
--data-raw '{
"city1": "London",
"city2": "Seattle",
"date": "2022-11-01"
}'

Application Components

The figure below shows the components that make up our weather application and how it receives and stores data.

Figure 1: High-level data flow of the weather application

Let’s look at each of the numbered components below in more detail so you can run this example yourself.

(1) Data Source

The data source is which provides JSON APIs for and weather data for specified locations.

WeatherAPI.com API Keys and Authentication

In order to be able to access the weather data using the history API, you need to sign up at and get an API Key, as shown below.

Figure 2: Weather API API key

To view the request and the response format from the history API, open the from Weather API and input your API Key as below. To request historical data, you must input the API Key for each API call.

Figure 3: Weather API interactive API explorer

Here’s an example response from the forecast API for London on 2022–11–09:

Figure 4: An example of Weather API JSON forecast response

(2) Application Configuration

The application configuration is stored in files .env.* in the root folder and named according to the patterns described in the following table.

The table below describes the application configuration parameters.

*Instructions on how to generate an application token .

To configure your instance of the application, you’ll populate the values shown in the .env.local file.

ASTRA_DB_ID=<ASTRA_DB_ID>
ASTRA_DB_REGION=<ASTRA_DB_REGION>
ASTRA_DB_APPLICATION_TOKEN=<ASTRA_DB_APPLICATION_TOKEN>
ASTRA_DB_KEYSPACE=<ASTRA_DB_KEYSPACE>
WEATHER_API_KEY=<WEATHER_API_KEY>

(3) Weather Data Extractor (/fetch API)

Source / raw data nearly always needs some transformation when applied to a new use case. This application is no exception. The queries mentioned are often fetching the data at weekly granularity, whereas the data made available by the weather.com API is at a daily level.

The API can solve this, by fetching the data from the weatherapi.com API. Then, it converts day-level data as a member in the weekly granularity document for the particular city in Apache Cassandra as shown below.

The API uses the to store the weather data. Additionally, the API stores the weekly weather document with the key as <cityName_weekNumber>.

Figure 5: An example of JSON response stored in Stargate Collection

The document stored in C* via the Stargate Document API has the key as <city_weekNumber> seen below for city Boston and week 26:

{
"documentId": "Boston_26", ←— documentKey aka primary key
"data": {
"city": "Boston",
"week": 26,
"forecast": [
{
"date": "2022-10-24",
"date_epoch": 1666569600,
"stats": {
"day": {
"mintemp_f": 50,
"maxtemp_f": 68,
"humidity": 85
}
}
}
]
}
}

This makes weekly querying for city a point lookup using the documentKey.

(4) TimeSeries Queries (/report1, /report2 APIs)

Now let’s examine the querying capabilities of the Stargate Document API via a set of TimeSeries queries for each report from . Want to try the queries yourself? They are available as a on the DataStax Teamspace on the Postman API Network.

Report 1: Weekly temperature report by day for a given location and a given week

Report 1 is a point query using the primary key on the data model from the developers’ point of view and exposed as API endpoint. For example, you could use the following query to query daily min and max temperatures for Boston for week 26.

curl --location --request POST 'localhost:3000/api/report1' \
--header 'Content-Type: application/json' \
--data-raw '{
"city": "Seattle",
"date": "2022-11-01"
}'

The response to this query will look something like this:

{
"forecast": [
{
"date": "2022-10-30",
"stats": {
"day": {
"maxtemp_f": 53.2,
"mintemp_f": 47.8
}
}
},
{
"date": "2022-10-31",
"stats": {
"day": {
"maxtemp_f": 55.2,
"mintemp_f": 45.7
}
}
},...
]
}

The resulting plot in the app looks like this:

Figure 6: A graph showing a weekly temperature report

Report 2: Weekly temperature comparison for two given locations and a given week

Report 2 is a bulk get with filters example and exposed as API endpoint. For example, you could use the following query to get a weekly temperature comparison for Boston and Seattle for week 26.

curl --location --request POST 'localhost:3000/api/report2' \
--header 'Content-Type: application/json' \
--data-raw '{
"city1": "London",
"city2": "Seattle",
"date": "2022-11-01"
}'

The response to this query will look something like this:

{
"forecast": [
{
"date": "2022-10-30",
"stats": {
"city1":
{
"city": "London",
"day": {
"maxtemp_f": 53.2,
"mintemp_f": 47.8
}},
"city2":
{
"city": "Seattle",
"day": {
"maxtemp_f": 43.1,
"mintemp_f": 37.8
}}
}
},
{
"date": "2022-10-31",
"stats": {
"city1":
{
"city": "London",
"day": {
"maxtemp_f": 52.2,
"mintemp_f": 45.8
}},
"city2":
{
"city": "Seattle",
"day": {
"maxtemp_f": 47.1,
"mintemp_f": 39.8
}}
}
},...
]
}

Report 3: Min and max temperatures for a location and a week

Report 3 is an example of an aggregation query since you have to compute the min and max across the week. What we have stored is the min and max per day. As of today, aggregations are not supported by the and thus the onus lies on the client to do any aggregations. For example, to get the weekly min and max temperatures for Boston for week 26, the client would execute via the Document API to get the temperature values for the week and then compute the min and max locally.

Time Series Related Improvements for the Stargate Document API

Based on the analysis above, we can see several possible ways in which extending the Document API could support additional functionality and take on some of the work currently performed by the application:

  • Support for aggregation querying functions on the returned documents.
  • Namespace and Collection level metadata. For example, the number of documents in a collection, the number of rows, or a list of distinct keys. This data could be useful for computing page sizes or deciding whether to retrieve data synchronously or asynchronously at runtime.
  • Server-side sorting on a specific field.
  • Precomputed key level aggregates — average, median, min/max, etc. during ingestion of the documents to Stargate.

Conclusion

Time series is one of the compelling data models for Cassandra. It’s a natural fit for the big table data model and scales well under various variations. There are lots of machine-generated data these days, so there’s no shortage of use cases similar to the examples above. For those users, the problem of storing data at machine-generating speeds and still keeping it organized in a useful manner is no longer a challenge.

The Stargate Document API transforms Apache Cassandra into a Document DB, a full-fledged native JSON data service over Cassandra. With no upfront schema definition or data modeling, developers can handle unstructured, JSON while leveraging Cassandra’s write speed, reliability, and scalability guarantees for a production-ready data store.

The Document API provides a wide range of write and query operators — for equality, range queries, array contains, server-side array push/pop. Additionally, the developers can also query at all the nested levels without local computations.

Try this out on the Astra DB free plan in the , or locally as a .

Follow the for more developer stories. Check our channel for tutorials and DataStax Developers on for the latest news about our developer community.

Resources

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DataStax

DataStax is the company behind the massively scalable, highly available, cloud-native NoSQL data platform built on Apache Cassandra®.