The 1 Billion row challenge with Singlestore

Anna Semjen
5 min readJan 31, 2024

--

January has arrived, the elves of advent of code have celebrated their Christmas, and I started to long for a new challenge especially now that I recently joined the Singlestore team as solutions engineer.

Singlestore claims to blazing fast, so I was grinning with a wide smile on my face when one of my friends sent the one billion row challenge to me. The proof of the pudding is in the eating, so I thought I’d enrol Singlestore into this challenge, and see how blazing fast it actually is.

File generation:

git clone git@github.com:gunnarmorling/1brc.git
./mvnw clean verify
./create_measurements.sh 1000000000

The generated file measurements.txt proves to grow to a monstrous 12.8 GB file, so make sure you have enough space if you attempt to run this.

Optionally if you don’t want to set up a Java environment you can follow our Notebook example here.

We pre-loaded the data to an S3 bucket, so all you need to do is check those statistics.

Uploaded the file to AWS S3

Singlestore cloud, can source CSV files from AWS S3 easily via pipelines therefore I had to make sure the file makes it to a bucket. I strongly recommend to use the AWS CLI to upload the file, as due to it being quite large, your session in the AWS console is likely to time out before the upload finishes.

aws s3 cp measurements.txt s3://asemjen2-bucket

Singlestore and the data

In this challenge, the task is to calculate the minimum, maximum, and the average temperature per weather station, and this information drives how I defined my table. Singlestore is a distributed database and offers two main concepts for making sure query performance is outstanding. The first one is sort key selection. Sort keys define logical segments in a columns store. The single most important consideration for columnstore tables is setting a sort key. When filtering on the sort key, the amount of scanning the engine must perform is decreased. There is a minimum and maximum data point for each segment, if the segment is irrelevant it is automatically discarded. This eliminates the need for sequentially checking every single row, if set correctly. The second concept is the shard key. Data is distributed on leaf nodes within SingleStoreDB, by defining the shard key we define how to distribute the data across multiple nodes. Why is this important? If we choose wisely, SingleStoreDB Cloud only has to involve a single partition if the query filter matches the shard key, which greatly reduces workspace resource usage for high concurrency workloads. In other words our aggregator worker will need to work less to provide us the query results.

CREATE TABLE cities(
city text(20) CHARACTER SET utf8 COLLATE utf8_general_ci,
temperature text(6) CHARACTER SET utf8 COLLATE utf8_general_ci,
SHARD key(city),
SORT KEY(temperature));

Once the table is created, we are ready to pull the data through via a Pipeline.

CREATE or replace PIPELINE onebrc AS
LOAD DATA S3 'asemjen2-bucket/measurements.txt'
CONFIG '{"region":"us-east-1"}'
CREDENTIALS '{"aws_access_key_id": "********************",
"aws_secret_access_key": "*****************************"}'
INTO TABLE cities
fields terminated by ";"
;
START PIPELINE onebrc;

Collecting the stats

Uploading the data in this scenario creates only one batch (as we had only one file) and we could simply grab the time it took for Singlestore to insert the data from a table.

select batch_time/60 as min ,*
from information_schema.PIPELINES_BATCHES where pipeline_name = 'onebrc';

The size of the file heavily influences how long it takes to load it. I split the large measurements.txtfile into two, and defined two separate pipelines as well. The time to load the file was less than half of the initial 496 seconds. SingleStore is amazingly good at handling concurrency, so the fact that the two pipelines were accessing the same table did not slow down the operation.

Query times with split files

The query

select GROUP_CONCAT( '{', city, '=', mint, '/',avgt, '/', maxt,  '}'
order by city)
from (
Select city,
min(temperature :> float(4,1)) as mint,
max(temperature :> float(4,1)) as maxt,
avg(temperature :> float(4,1)) as avgt from cities
Group by city);

The reason the query looks funky, is that the Java challenge expects the solution to use the brackets, the slashes and to be ordered alphabetically. The middle part of the query should return the familiar tabular output. And in performance there is little to no difference.

So how long did it take to run?

The execution time is is under 5 seconds, which is quite amazing, given the size of the data. Although it doesn’t beat the amazing 2s where the Java solutions are at the time of writing this article, it is still cosiderably fast compared to other database vendors in the market, and at the end of the day it is just a few lines of SQL.

Interesting note is that once the output is observed it starts with Abéché, while on Gunnar’s expected output the first city is Abha.

All the scripts and queries were run on a relatively small S4 type of instance of Singlestore, with increasing the size of the workspace to S16, the query runs for 3 seconds only which is very impressive.

Conclusion

There may be further ways to explore how to tune the pipeline performance or write an even more efficient query. However even without an overly complicated or polished solution SingleStore delivers on the promise; that is why lots of our existing customers choose us for AI use-cases, data consolidation, and to create a strategic data platform. Try SingleStore for Free!

--

--