Performance Testing and Comparison of Time-Series Databases: InfluxDB and Machbase

Machbase

Published in

machbase

8 min readAug 25, 2023

● Introduction
● How to test input
● Query testing and methods
● Compare performance
● Conclusion

Introduction

In this post, I will discuss the performance comparison results between InfluxDB and Machbase, including the trials and errors experienced during the process. I will focus solely on the aspects that allow for an equitable comparison between the two products.

How to test input

Set up your test environment and data

We conducted tests using an internal development setup with a 4-core processor, 32GB of RAM, and an SSD. We inserted TAG data (1 million TAGs, 1 billion records) into Machbase, then converted this data into a CSV file and inserted it into InfluxDB. We will compare the input performance, query performance, and ease of use between the two products..

Intel(R) Xeon(R) CPU E3–1231 v3 @ 3.40GHz (4core/8thread)
DDR3 32GB
Samsung 860Evo 1TB
centos-release-7–6.1810.2.el7.centos.x86_64
InfluxDB 1.7.7

The test data was generated artificially, with 1 million sensors producing data 10 times per second, resulting in a total of 1 billion records. Upon evaluating data input performance in Machbase, approximately 430,000 records were observed to be ingested per second (based on the debug version). This data was recorded into a CSV file using the CSV export tool provided within the Machbase package.

Machbase’s input rate: around 430,000 entries/second

Enter data into InfluxDB

Installing InfluxDB is straightforward. However, importing a CSV file generated from the input data in Machbase was far from easy. InfluxDB doesn’t offer tools like CSV import for CSV file ingestion. The InfluxDB website provides the following method to upload CSV files:

Input via Chronograf
Input via Telegraf
Input via Influx

The first method involves using the web UI of InfluxDB (Machbase uses MWA). However, if the data file size exceeds 30MB, the input process fails. It’s not possible to input a CSV file of around 50GB in size.

The second approach is to use a tool like Machbase’s Machcollector to input data in real-time from sources like log files, but we didn’t use this method because it’s slow.

We utilized the last method to input the data.

There isn’t a direct feature to input CSV files into Influx. To import data into Influx, it needs to be transformed into the text format used by Influx. Referring to a sample Python script provided by Influx, we wrote code to convert the CSV file into a text file compatible with Influx.

Using this code, we transformed the CSV file into a text file, which took nearly 20 hours. However, there was an issue: the directive part specifying database creation and the target for input needed a space between them, which was missing. 60GB text file was too large to edit, so we had to recreate it after identifying the problem.

During the input of approximately 5.5 million records, an error occurred. Upon investigation, it was discovered that the “max-values-per-tag” setting was at 100,000, necessitating an adjustment. We modified the configuration and proceeded with the input.

Query testing and methods

Measuring query performance in InfluxDB was also not straightforward. InfluxDB does not display the time taken to execute queries. As a result, most InfluxDB queries were compared to Machbase based solely on an intuitive sense of speed.

There were numerous inconvenient and challenging constraints when conducting query tests. Although seemingly trivial, the most frustrating and difficult aspect was having to enclose schema objects within double quotes (“). While SQL standards do allow for names enclosed in double quotes, it was initially challenging to adapt to this method, as it deviated from the habitual approach I was accustomed to.

SELECT "value" FROM "sensor_data".."tag_data" WHERE "tag_name" ='TAG_0001'

There are a number of other limitations, each of which is discussed below.

Limitations on using Group By

Data stored in InfluxDB includes a tag_key corresponding to indexed columns and a value_key corresponding to non-indexed columns. Columns eligible for the group by clause are limited to the time column and the tag_key column. Since there is no CREATE TABLE statement, it’s essential to determine whether these columns are indexed or not during data input and set them in the tag_key section accordingly. If you input the value into the value_key after insertion, you won’t be able to use it in the group by clause unless you delete and re-enter the data.

Tag key selection error

This issue was truly bewildering. When selecting only columns set as TAG KEY, the result doesn’t display anything.

When data is input in the following state:

tag_data,name=EQ0^TAG0 value=134.03,lot_no=1001 1483196400000000000

In the above input situation, the index is created on the tag_key “name”, and a query on the name column returns no results.

SELECT "name" FROM "tag_data"

No error message was displayed either. Later, upon investigation, I found that if you search for just “tag key” on its own, no results appear.

Similarly, attempting to search for values in the “time” column (in this case, an error is displayed) also doesn’t yield any results. The following query returns an error.

 select "time" from "tag_data" where "name"='EQ0^TAG0';

ERR: at least 1 non-time field must be queried

Abnormal shutdown

The following query causes Influx (not the server process influxd) to use unlimited memory and eventually terminates.

Select mean("value") from "tag_data" where "name"='EQ0^TAG0' group by time(1s)

Executing a similar query in Machbase yields results in just 0.01 seconds.

Mach> select name, avg(value), date_Trunc('second', t_time) as t from tag where name ='EQ0^TAG1' group by name, t;

While attempting to resolve the issue of query execution time not being displayed, I encountered a problem when performing queries using shell commands. If the string entered in the Group by clause lacks quotation marks, influxd consumes CPU and memory infinitely before terminating abnormally. To successfully execute the above query, it is essential to explicitly specify the Group by clause for the time value.

select mean("value") from "tag_data" where "name" = 'EQ0^TAG10'and "time" >= 1483196400000000000 and time <= 1483196462400000000 group by time(3s)

The following query leads to the server process influxd consuming CPU and memory without bounds, eventually resulting in an abnormal termination.

SELECT value FROM sensor_data..tag_data WHERE "name" = EQ0^TAG0

Here’s what the system looks like a few seconds after running the query.

Mem : 32729528 total, 11867052 free, 14498128 used, 6364348 buff/cache

KiB Swap: 16449532 total, 6253320 free, 10196212 used. 17659684 avail Mem

 PID USER   PR NI  VIRT  RES  SHR S %CPU %MEM   TIME+ COMMAND  

25412 root   20  0  35.7g 18.3g  5.2g R 464.8 58.6 10:48.17 influxd

The following query nearly exhausts the available memory during query execution and takes more than 10 seconds.

select "value", "name" from "tag_data" limit 10

Due to various anomalies during query operations and a lack of accurate execution times, we were unable to run a direct 1:1 performance comparison, but we were able to get the following results.

Machbase TAG table and InfluxDB have nearly equivalent query performance for “well-behaved” queries. The problem is that the number of queries that can be performed equivalently is small, and InfluxDB doesn’t show query execution times, so it’s hard to make an accurate comparison.

Compare performance

We used Machbase’s sensor data test suite to input 1 billion pieces of sensor data into Machbase and Influxdb and performed basic search tests.

Input performance

As mentioned earlier, we remove the time spent converting the CSV file for input from InfluxDB and only measure the time spent inputting using the converted text file.

The input performance of Machbase and InfluxDB can be seen in the chart below. Machbase’s input performance is about seven times higher than InfluxDB’s.

Performance of simple searches

This is a query to retrieve all data of a specific sensor from the entire dataset. I executed the query using the following request.

select count(*) from (select * from tag where name = 'EQ0^TAG14' and t_time

    between to_date('2017-01-01 00:00:00') and to_date('2017-01-01 01:00:00') ); -- machbase

SELECT * FROM tag_data WHERE "name" = 'EQ0^TAG1' AND time >= '2017-01-01T00:00:00Z' AND time < '2017-01-01T01:00:00Z' -- influx

When searching for data corresponding to a one-hour duration with the addition of a time condition, I used the following query. Please note that InfluxDB doesn’t include the result transmission time in the displayed execution time, so the query is slightly different.

select count(*) from (select * from tag where name = 'EQ0^TAG14' and t_time

    between to_date('2017-01-01 00:00:00') and to_date('2017-01-01 01:00:00') ); -- machbase

SELECT * FROM tag_data WHERE "name" = 'EQ0^TAG1' AND time >= '2017-01-01T00:00:00Z' AND time < '2017-01-01T01:00:00Z' -- influx

Here’s the result of the run

In the case of simple queries like this, it’s evident that Markbase exhibits better performance compared to InfluxDB.

Performing set operations

For a one day duration, querying the hourly average values is performed as shown below. Due to the difference in Timezone settings between InfluxDB and Machbase, there is a slight variation in the time values’ range.

select count(*) from (select /*+ ROLLUP(TAG, hour) */ t_time, value from tag where name ='EQ0^TAG1'

 and t_time between to_date('2017-01-01 00:00:00') and to_date('2017-01-02 00:00:00' )); -- machbase rollup

  

SELECT mean(value), time FROM tag_data WHERE "name" = 'EQ0^TAG1' AND

 time >= '2016-12-31T15:00:00Z' AND time < '2017-01-01T15:00:00Z' GROUP BY time(1h) -- influx

Upon executing the above query, the obtained results were as follows:

When performing simple retrieval queries and statistical queries, it’s evident that Machbase outperforms InfluxDB in terms of performance.

The following chart compares query execution times for InfluxDB versus Machbase.

Conclusion

Due to the distinct SQL language support approach of InfluxDB, traditional SQL users need to be cautious of functionalities or specific matters not supported by InfluxDB. InfluxDB, which exclusively supports the REST API, differs from the conventional APIs of DBMS in terms of usage. Furthermore, its performance is considerably lower compared to Machbase.

For customers considering the adoption of open-source time-series DBMS like InfluxDB, we recommend Machbase, which offers full compatibility with RDBMS and is optimized for time-series data queries.

Thank you.

Machbase CRO, Grey Shim

🧭Homepage 🚀Machbase Neo📍Github🗣️️LinkedIn 🎬Youtube

📧Email

Machbase stands as the world’s fastest timeseries database, offering an ideal solution for diverse environments. Whether it’s edge devices with limited resources or clusters processing massive amounts of data, Machbase excels in scalability to meet the demands of any scenario.