Serving huge amounts of data from ADX without killing your server

Published in

DataReply

9 min readFeb 26, 2024

What is Azure Data Explorer

Azure Data Explorer, abbreviated sometimes to ADX, is a fast and highly scalable data exploration service provided by Microsoft Azure. It is designed to handle and analyze large volumes of data in real-time, making it particularly well-suited for log and telemetry data, as well as other streaming and batch data sources. Azure Data Explorer supports a query language called Kusto Query Language (KQL), which allows users to perform complex queries and aggregations on their data. The service provides powerful analytical capabilities, including the ability to visualize and gain insights from large datasets quickly. It is commonly used in scenarios such as monitoring, diagnostics, and analyzing machine-generated data, offering a robust solution for organizations dealing with high-velocity and high-volume data streams.

Working with time-series data

Now that we got the description out of the way, we con go down into more details. The main use case for Azure Data Explorer is time-series data, and we have been using it with telemetry data. One can find various offers for storage solutions for time-series data, the most known ones are probably InfluxDB, Prometheus or Cassandra. Sometimes, one is off better using files directly if all of one’s use cases evolve around the visualization of stored data.

Selecting the correct database

Selecting the most suitable time-series database for a specific project requires a nuanced understanding of the differences among various options.

InfluxDB, known for its simplicity and scalability, excels in handling high write and query loads, making it an excellent choice for real-time analytics and monitoring. On the other hand, Prometheus, an open-source solution, is favored for its robust monitoring and alerting capabilities, seamlessly integrating with Grafana for visualization. Elasticsearch, originally designed for full-text search, has evolved into a versatile analytics engine, particularly adept at handling log data. Graphite, a lightweight option, is renowned for its straightforward storage of time-series data and ease of integration with Grafana. OpenTSDB, built on Apache HBase, is celebrated for its scalability in big data scenarios. TimescaleDB leverages the reliability of PostgreSQL, while Apache Cassandra offers horizontal scalability for distributed environments. ClickHouse, a columnar database, stands out for its exceptional performance in analytical queries on extensive datasets. Each of these time-series databases brings unique strengths to the table, catering to diverse needs in the realm of data analytics and management. The choice ultimately hinges on factors such as scalability requirements, ease of use, community support, and alignment with the specific demands of the project at hand.

Where does Azure Data Explorer fit in this case? It has a little bit of everything of the alternatives mentioned above. You can ingest high loads of data by using the scaling function of the database, tiered storage and a wide array of analytical queries thanks to the capabilities of it’s own query language (KQL). When one is looking for data storage solutions, but also to provide the users the possibilities to run complex queries on said data, then Azure Data Explorer might be the right solution.

Types of time-series data

Common examples of time-series data include stock prices over time, temperature readings at regular intervals, daily website traffic, and sensor data from IoT devices. Analyzing time-series data is crucial for understanding historical patterns, making informed decisions, and predicting future trends or events. Normally a user would analyze all of this data within the Azure Data Explorer environment, however, this is not always the case. Loading data onto external systems for visualization or analysis purposes is also a common use case. And here is where the first bottlenecks start to appear. Exposing required data to external systems provides a challenge of it’s own, especially when it’s high frequency data that has to be served in a timely manner. The sample data we are going to work with represents sensor data coming in from various sensors. To keep things simple, for this example we can generate some sample data in an ADX cluster using the following KQL query:

range timestamp from 1700000000001 to 1700000100000 step 1
| extend timestamp = unixtime_milliseconds_todatetime(tolong(timestamp)), value = rand(), sensor = "sensor1"

The above query can be tried on a free cluster and you can connect to it using the Azure Data Explorer or the Kusto Explorer. We will be showing screenshots from the Kusto Explorer, but the results are displayed in a similar way on the Azure Data Explorer as well.

This will generate some random data with an associated timestamp sampled at 1000Hz that mimics data captured from a sensor. The sample should look like this:

Sample records generated from the sample above

Proprietary applications

Connecting proprietary applications with ADX involves a strategic integration process to leverage the full potential of data analytics and insights. By seamlessly integrating proprietary applications with ADX, organizations can unlock a wealth of data stored within their applications, enabling more informed decision-making. ADX’s capabilities in handling large volumes of diverse data make it an ideal platform for processing and analyzing data from various sources. Through this integration, businesses can establish a unified data ecosystem, allowing for real-time querying, visualization, and reporting. This can be achieved in two separate ways, either give the application direct access to the cluster, or introduce an interface that serves data from the ADX cluster. We should follow the second option for a better modularity and decoupling of layers.

So an overview of the current solution should look like the following:

Sample data flow for data ingestion and querying

A separate layer ingests the data from the various sources, while a server would serve as a management layer on the queries run on the cluster.

Exposing data via a REST API Server

One of the first issues we saw when exposing this type of data to external systems was memory usage for the various clients. Due to how the clients for ADX work, the query results have to be buffered on the server itself, before being served as a response to the client application. Our servers, built with Java and Kotlin and mainly serving data via a REST API interface were seeing huge amounts of queries from users for various sensor data and this caused problems with memory usage, frequently allocating and de-allocating a lot of memory for each API call. So we got down to analyze where the issue was.

API Server behavior when data is retrieved from ADX

After the analysis it became clear that the Kusto Java Client is using a lot of memory to hold all of the information we request from the ADX cluster. And due to the nature of the data, most of it was BigDecimals. So we had to come up with alternatives on how to get the data out of ADX and send it to the clients. The first thing we tried was reducing the payload. Due to the data having a fixed rate, we could return a smaller payload by packing the data and returning only the necessary information, such as start and end timestamps, frequency and data points. An example of such can be generated with the following KQL query:

range timestamp from 1700000000001 to 1700000100000 step 1
| extend timestamp = unixtime_milliseconds_todatetime(tolong(timestamp)), value = rand(), sensor = "sensor1"
| summarize minTimestamp = min(timestamp), maxTimestamp = max(timestamp), frequency = 1000, data = make_list(value) by sensor

The resulting data would look like this:

Sample structure for packing fixed rate time-series data

The API servers would then pack the data in a nice JSON and serve it to the application. While this reduced the payload, it still did not solve our issues with the memory usage as users started requesting more and more data.

Then we decided to move the whole payload packing operation to the ADX cluster itself. Basically the JSON payload would be constructed in the cluster, so that the servers would not use so much memory. The following piece of KQL query packs the data in separate JSONs for each sensor:

range timestamp from 1700000000001 to 1700000100000 step 1
| extend timestamp = unixtime_milliseconds_todatetime(tolong(timestamp)), value = rand(), sensor = "sensor1"
| summarize minTimestamp = min(timestamp), maxTimestamp = max(timestamp), frequency = 1000, data = make_list(value) by sensor
| project payload = pack_all()

The resulting data looks like this:

Sample structure for the above KQL query

However, memory was still an issue in the servers, so we decided to go down to check what was going on. From the screenshots below on the memory usage recorded with the Java Flight Recorder, one can see that occupying more than 900 MBs of memory for a single call is not sustainable in the long run.

Memory usage for a single thread execution

Memory usage for the Kusto Java Client when the response is de-serialized fully

This was mainly caused by the Kusto Client de-serializing the stream and mapping the JSON to an internal data structure. Most of the memory for example is used by the BigDecimal datatype due to the nature of the data.

Optimizing resource usages

We had to come up with a solution to skip this de-serializing step of the Kusto Java Client. So we decided to move all of the formatting and transformations of the data to the ADX cluster. This would effectively mean that the server was only being used to buffer the payload. No de-serializing means less objects in memory, and a lower memory consumption.

The resulting dataset is wrapped inside a predefined structure that ADX uses for it’s responses. It wraps the data in a structure that also holds additional information, and we would like that info to not be returned. So we had to come up with a hybrid option.

That hybrid option deserializes the wrapper, but treats the dataset itself as a collection of Strings. Then we wouldn’t need to deserialize the whole response, but rather only the first layers where the wrapper is.

The only way to do this, is to return a single column, where each element has the data formatted in the resulting schema. In this case, the Kusto Client needs to treat each cell in the result as a String, and there is no setting on the Client itself. What it does, however, is that it reads the data type from the resulting table.

We can check this data type by checking the schema of the resulting table. To do that, we can add a getschema operator to our query as in the query below:

range timestamp from 1700000000001 to 1700000100000 step 1
| extend timestamp = unixtime_milliseconds_todatetime(tolong(timestamp)), value = rand(), sensor = "sensor1"
| summarize minTimestamp = min(timestamp), maxTimestamp = max(timestamp), frequency = 1000, data = make_list(value) by sensor
| project payload = pack_all()
| getschema

You can see that the payload column is of type dynamic:

Resulting table schema executing the query above

This instructs the Kusto Client to still de-serialize it. What we would need is to force the column to have a string ColumnType so that the Client treats the cell as a string and does not de-serialize it. The KQL query above would be rewritten as in the following:

range timestamp from 1700000000001 to 1700000100000 step 1
| extend timestamp = unixtime_milliseconds_todatetime(tolong(timestamp)), value = rand(), sensor = "sensor1"
| summarize minTimestamp = min(timestamp), maxTimestamp = max(timestamp), frequency = 1000, data = make_list(value) by sensor
| project payload = tostring(pack_all())

The results would look like the following, one line for each sensor, in our case we only have one:

Changing the implementation to respect this query structure, we had some surprising results in memory usage for the same dataset:

— Initial default setup: 906.63MB of memory usage

— Partial deserialization: 30.57MB of memory usage

This also had implications in the CPU usage of our application, which we will not cover here.

Memory usage for the Kusto Java Client when the response is deserialized partially

Results and outcome

In the end we observed a reduction of over 90% in memory usage for the same data, making the server more efficient and saving resources. The server only needs a fraction of the memory resources it would have needed beforehand and can serve multiple clients at once with the same memory and CPU limits. Using the capabilities of the ADX cluster itself and not relying on Java to manage the resulting payload saved us a lot of headaches along the run.