Exploring the structure and execution speed of Machbase Neo’s Transforming Query Language (TQL).

Machbase Neo has a data conversion language inside the database engine, so you don’t need to install a separate data conversion application when you input or output data.

Machbase
machbase
8 min readDec 14, 2023

--

● Introduction
● Overview of TQL
● Internal Structure and Execution Speed
● Conclusion

Introduction

Machbase Neo has a data conversion language inside the database engine, so you don’t need to install a separate data conversion application when you input or output data. I would like to briefly explain what this data conversion language is and explain this part because there might be some concern that it might affect the speed of the database engine.

Overview of TQL

TQL stands for Transforming Query Language. To build a data platform for sensor data, you first have to be able to properly collect and process the data coming from the sensors. And when you need to transfer data from your database to other systems, you have to do it in the right format.

The purpose of TQL is to allow you to do this by converting data formats without developing additional applications.

We’ll talk about this in more detail below. As application developers, we typically follow a similar approach when building applications that leverage databases. The process typically begins with querying the database and retrieving data in a tabular form (rows and columns). It involves converting the data into the desired structure, followed by manipulation and presentation of the final result in requested formats, such as JSON, CSV, or a chart.

TQL simplifies this process with just a few lines of script.Additionally, other interworking applications call TQL via an HTTP endpoint, treating it as if it were an executable API.

TQL (Transforming Query Language) serves as a DSL for data manipulation. A record consists of a key and a value. The key is typically an auto-generated sequential integer, similar to ROWNUM in a query result, while the value is a tuple containing the actual data fields.

TQL script starts with SRC (source) that defines how to retrieve data and generates records by transforming the raw data. The SINK should be the end of TQL script which defines how to output records.

In some cases TQL script needs to transform records, involving mathematic calculation, simply string concatenation or interacts with external databases. Those tasks can be defined in MAP functions.

So, TQL script should start with SRC and end with SINK and it can has zero or one more MAP functions.

SRC

There are serveral SRC functions. For example, SQL() produces records by querying machbase-neo database or even external (bridged) databases with the given sql statement. FAKE() generates artifitial data. CSV() can read csv data, BYTES() reads arbitrary binary data from file system or client’s HTTP request and MQTT payload.

SINK

The basic SINK function might be INSERT() which writes the incoming records onto machbase-neo database. CHART() familiy functions render a chart with incoming records. JSON() and CSV() encode incoming data into proper formats.

MAP

MAP functions are the core of the transforming data.

For more information about the Machbase Neo TQL language, please refer to the technical documentation below, which covers it in more detail.

Internal Structure and Execution Speed

Now let’s talk about whether or not TQL affects the speed of the database engine.

Before we get into the details, we need to define some terms to communicate something, so let’s get that out of the way first. For ground truth purposes you might want to draw it horizontally, but I personally like to think of TQLs as flowing from top to bottom, as shown above. The first diagram shows that each TQL starts with an SRC, goes through 0 to n MAPs, and ends with a SINK, as we’ve explained frequently in the Neo documentation. The diagram on the right shows that in the actual code that implements TQL, the SRC, MAP, and SINK are implemented as objects called Nodes. This explains why the individual expressions in TQL (SRC, MAP, and SINK) are called nodes.

The structure of a Node can be described as shown in the figure above. Each Node has its mailbox, and when a job arrives in the mailbox, it fetches and processes it, placing the result in the mailbox of the next Node. SRC and SINK are special types of nodes; as illustrated in the figure, SRC reads INPUT data (not from the mailbox) to create a record, while SINK writes the record in a defined format (CSV, JSON) as OUTPUT to the next node. Each node operates within its lightweight thread (for simplicity, let’s refer to it as a thread) with its dedicated mailbox, eliminating concerns about concurrency (interference from other threads) during its execution. This inherent design makes it lock-free.

This concept is borrowed from a programming model for lock-free asynchronous processing called the Actor Model, and in the case of Akka, one of the real-world implementations of this model, it is possible to implement a concurrently distributed cluster with very complex and varied settings, including the size of mailboxes, what happens when a mailbox is full, setting up shared mailboxes between multiple nodes (actors), and sending messages between remote nodes.

In TQL, the mailbox is exceptionally simple, typically having a size of 1. A size of 1 implies that, for instance, in the provided illustration, if Node B is engaged in a computationally intensive task and takes more time, while Node A is performing a simple and fast operation, B’s mailbox will consistently be full. Consequently, A has to wait because it cannot place its finished work into B’s mailbox. If A waits for B’s mailbox to empty, A’s mailbox will also become full, causing the SRC to wait as well, unable to deliver the next job.

While this might seem like an inefficient scenario where everyone is waiting, it is not. The process is implemented as a Go routine and a Go channel, which means it doesn’t actually consume threads. Furthermore, once you grasp how to measure speed in concurrency constructs — a crucial topic that initiated our conversation — you will likely stop worrying about TQL speed.

Let’s compare this to the traditional way of implementing a program in C, Java, or Go that queries a database, sequentially performs defined operations, and prints out the results. For the sake of comparison and explanation, let’s assume that the records we’re fetching from the database have five columns, and we’re conducting operations on the values in each column.​

What this process looks like when implemented in a sequential program is illustrated in the figure on the left. Here, we execute five operations on a single record retrieved from the ResultSet within a for loop. If each operation takes 1 second, then it requires 5 seconds per record. Consequently, processing N records takes N*5 seconds.

Let’s assume that each operation in the TQL model takes 2 seconds, as shown above. Given the back-and-forth of jobs and the involvement of the interpreter, and for the sake of simplicity, let’s assume that it takes 10 seconds to process a record. Will it then take N10 seconds to process N records? No, it will not. As explained earlier, the records don’t flow one at a time; they flow simultaneously as an ecosystem, with all the mailboxes populated from Node 1 to Node 5. So, it takes 10 seconds + N2 seconds to process all N records. This implies that even if we assume that individual nodes are twice as slow as sequential code, the arithmetic shows that the overall processing speed is more than twice as fast as N gets larger. Conversely, it’s more than twice as slow as N approaches 1.

Now, consider the case of adding task #6: 6N seconds for the left side, which adds N seconds, and 12+N2 seconds for TQL, which adds 2 seconds. A simplified version of this story is that the speed at which TQL processes a large number of records is independent of the number of nodes that make up the entire work chain and is determined only by the slowest node among them.

Conclusion

If you consider this definition, you’ll see that each node in TQL, each individual function like MAPVALUE() and FILTER() (referred to as a MAP function), should never be made heavy. It’s highly advantageous, in terms of final speed, to break these functions into smaller and smaller pieces and then combine them in TQL to process data. This philosophy aligns with functional programming, borrowing the term MAP from TQL. However, we can delve deeper into this topic another time. Suffice it to say, if you’re writing TQL, there’s no need to worry about the number of nodes slowing you down.

🧭Homepage 🚀Machbase Neo📍Github🗣️️LinkedIn 🎬Youtube

📧Email

Machbase stands as the world’s fastest timeseries database, offering an ideal solution for diverse environments. Whether it’s edge devices with limited resources or clusters processing massive amounts of data, Machbase excels in scalability to meet the demands of any scenario.

--

--

Machbase
machbase

MACHBASE is the world's fastest time series database for IIoT data, with TPC-validated performance (https://zrr.kr/c4Si).