DataFrame and The One Billion Row Challenge

How to use a Java DataFrame to save developer time, produce readable code, and not win any prizes.

Vladimir Zakharov
8 min readApr 2, 2024

Introduction

This post shows how to implement a solution to The One Billion Row Challenge (1BRC) using a Java DataFrame library. The goal is not to create a competitive solution within the context of the challenge. Rather 1BRC is used as a helpful use case to illustrate some reasons for why one would want to consider using a DataFrame.

There are sections below explaining what 1BRC is about and what a DataFrame data structure is at a high level. Feel free to skip them if you are familiar with either of those.

What is The One Billion Row Challenge (1BRC)?

The One Billion Row Challenge (aka 1BRC) was organized by Gunnar Morling (@gunnarmorling) and ran from January 1st until January 31st 2024.

The purpose of the challenge was an exploration of how quickly one billion rows from a text file can be aggregated with Java. Quoting from the Gunnar’s blog post announcing the challenge:

“Your mission, should you decide to accept it, is deceptively simple: write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station. There’s just one caveat: the file has 1,000,000,000 rows!”

The details of the challenge, the scripts to generate test data, and the submissions are hosted on GitHub here https://github.com/gunnarmorling/1brc.

The times produced by the entries into 1BRC are nothing short of amazing. The entries placed first, second, and third completed the challenge in the reference environment (8 cores, 128 GB RAM) in 1.535, 1.587, and 1.608 seconds respectively!

I Have Thoughts

Looking at the 1BRC repo you will likely notice this for most submissions:

  1. They are many hundreds of lines of (well formatted) code.
  2. It is not immediately clear how the algorithms work, although some implementations have fairly helpful comments.
  3. It is not easy to tell just by looking at the code what the code does, even though a lot of the solutions have well factored code.

This makes sense in the context of the challenge and by itself doesn’t make these solutions “bad code”. Achieving absolute peak performance will require the above “sacrifices”. It takes the use of advanced Java optimization techniques, low level APIs, the latest language features, the latest JVM features, and some dark magic.

It also requires top expertise and intimate understanding of the JVM and the compiler behavior from the developer.

And it requires a good amount of the developer’s time (spent both writing and reading the code).

But what if these things — code readability, maintainability, developer time — mattered to us more than peak performance? What if the cycles spent developing, understanding, and maintaining this code by humans mattered more than achieving the absolute minimum of CPU cycles? What if we could deliver a working solution quickly and then optimize it if and when needed? Enter DataFrame.

What Is a DataFrame?

A data frame is a tabular data set that can be manipulated programmatically. Think of it as a database table, where different columns can contain values of different types (e.g., a string column, a numeric column, a date column).

Tabular data can be loaded as a data frame from a file, or from a database result set, or anything else that looks like a table or can be made look like a table. A data frame can also be created programmatically either by specifying its values or by transforming the existing data frames.

Java developers are well versed in data structures like the Java Collections Framework or its advanced replacements like Eclipse Collections. While great for working with primitive types or individual properties of objects, these rich collection types and APIs operate at a low level of abstraction.

Of course, we can also define domain classes and encapsulate data set operations resulting in more readable and better factored code, but this could be a heavyweight approach for simple and/or ad hoc scenarios.

As a workaround, when looking for “flexibility”, developers may be tempted by “Map-Oriented Programming” representing objects as maps, maps of maps, and so on.

One of drawbacks of both domain classes and maps is their storage overhead. This can become problematic in situations where you deal with millions of entities.

We need something that combines the efficiency of collection frameworks with the ability to group data (the way domain objects, or maps, or relational tables, or spreadsheets do), and easily transform and organize data in our code.

Data frames offer the benefits of developer efficiency, flexibility, and code readability, and depending on the use case can offer memory savings and better performance than the alternative approaches. They are used in real-world scenarios such as data transformation, data enrichment, data validation, and reconciliation.

About dataframe-ec

We are about to see how we can use a DataFrame to solve 1BRC. The implementation I will use for this example is the dataframe-ec framework <disclaimer> for which I happen to be the maintainer </disclaimer>. This DataFrame implementation is based on the Eclipse Collections framework (thus the “ec” in the name). By the way, I have another blog on dataframe-ec (“Set Operations With Data Frames”), which could be a fun side trail . It goes into a bit more detail about the framework and shows how to implement common set operations with it.

Another thing to note is that there are other Java DataFrame implementations, and the solution code for some of them will look very similar to what you’ll see below. Having said that, the dataframe-ec implementation of DataFrame was developed with specific goals in mind, and that explains some of the differences between it and other DataFrame libraries. Some of these goals:

  • Memory efficiency (for practical use case, where it matters)
    - Uses highly memory efficient Eclipse Collections
    - Support for primitive types
  • Inspired by Eclipse Collections APIs and exposes Eclipse Collections types in its APIs
  • Intuitive, humane grammar for the expression DSL used for computed columns, filters, etc. (e.g., adding two numbers is expressed as “A + B”, as opposed to an internal DSL where you assemble an expression from Java method calls)

Solving 1BRC With a DataFrame

The 1BRC solution can be broken into three distinct steps

  1. Load the data from the file.
  2. Perform aggregation and sorting.
  3. Show results on the console.

Let’s use a toy example to illustrate each step. Let’s say we have a file, measurements.txt, containing the following text:

New York City;34.1
New York City;24.3
San Francisco;22.9
Istanbul;5.9
New York City;-2.7
Istanbul;15.0
San Francisco;-5.4
Istanbul;13.2
San Francisco;35.0
Tauranga;17.4

The first step, loading the data from the file, is pretty straightforward. The file format is prescribed by the requirements of 1BRC. First we are going to describe the file layout using a schema object. The data is in a CSV file (well, almost, the separator is a semicolon ‘;’) and this file has no header. The file has two columns: “Station” — the location where the temperature was taken, and “Temperature” — the measurement itself. Once we define the schema, we create a data set object pointing to an actual file with the data described by the schema. And then we load the data set into a data frame.

String MEASUREMENT_PATH = "C:/projects/1brc";
String MEASUREMENT_FILE = "measurements.txt";

CsvSchema msSchema = new CsvSchema()
.addColumn("Station", STRING)
.addColumn("Temperature", FLOAT)
.separator(';')
.hasHeaderLine(false);

CsvDataSet msDataSet = new CsvDataSet(
Path.of(MEASUREMENT_PATH, MEASUREMENT_FILE),
"measurements",
msSchema);

DataFrame measurements = msDataSet.loadAsDataFrame();

After this code fragment is executed, the data frame measurements will contain the data loaded and parsed from the file:

┌───────────────┬─────────────┐
│ Station │ Temperature │
├───────────────┼─────────────┤
│ New York City │ 34.1 │
│ New York City │ 24.3 │
│ San Francisco │ 22.9 │
│ Istanbul │ 5.9 │
│ New York City │ -2.7 │
│ Istanbul │ 15 │
│ San Francisco │ -5.4 │
│ Istanbul │ 13.2 │
│ San Francisco │ 35 │
│ Tauranga │ 17.4 │
└───────────────┴─────────────┘

The second step is aggregating and sorting data. We will use a version of the aggregateBy method to perform our aggregation. This version takes a list of aggregators as its first parameter. Here we use built in factory methods to create aggregators, each factory method takes two parameters — the name of the column with the source data and the name of the column in the aggregated data frame, which will store the aggregation results. The second parameter of aggregateBy is the list of columns (in this case just one) to group by the aggregation, similar to the GROUP BY clause in a SELECT SQL statement.

Sorting of the aggregated DataFrame is performed by calling the sortBy method with the list of columns (again just one in this case) to sort it by as a parameter.

DataFrame aggregated = measurements
.aggregateBy(
Lists.immutable.of(
min("Temperature", "Min"), avg2d("Temperature", "Mean"), max("Temperature", "Max")
),
Lists.immutable.of("Station"))
.sortBy(Lists.immutable.of("Station"));

After aggregateBy is called the result looks like this:

┌───────────────┬──────┬──────────┬──────┐
│ Station │ Min │ Mean │ Max │
├───────────────┼──────┼──────────┼──────┤
│ New York City │ -2.7 │ 18.56667 │ 34.1 │
│ San Francisco │ -5.4 │ 17.5 │ 35 │
│ Istanbul │ 5.9 │ 11.36667 │ 15 │
│ Tauranga │ 17.4 │ 17.4 │ 17.4 │
└───────────────┴──────┴──────────┴──────┘

and after sortBy(Lists.immutable.of("Station")) the data frame aggregated looks like this:

┌───────────────┬──────┬──────────┬──────┐
│ Station │ Min │ Mean │ Max │
├───────────────┼──────┼──────────┼──────┤
│ Istanbul │ 5.9 │ 11.36667 │ 15 │
│ New York City │ -2.7 │ 18.56667 │ 34.1 │
│ San Francisco │ -5.4 │ 17.5 │ 35 │
│ Tauranga │ 17.4 │ 17.4 │ 17.4 │
└───────────────┴──────┴──────────┴──────┘

And finally iterating over the aggregated data and printing the contents is done using the forEach iterator method. This method takes as a parameter a procedure (a Consumer). This procedure is called for each row in the data frame and accepts an object representing the current row (or cursor) that can be used to extract column values from that row.

aggregated.forEach(c ->
System.out.printf(
"%s=%2.1f/%2.1f/%2.1f\n",
c.getString("Station"), c.getFloat("Min"), c.getDouble("Mean"), c.getFloat("Max")));

This code will produce this output on the console:

Istanbul=5.9/11.4/15.0
New York City=-2.7/18.6/34.1
San Francisco=-5.4/17.5/35.0
Tauranga=17.4/17.4/17.4

And now let’s bring all these bits together into a complete solution:

package onebr;

import io.github.vmzakharov.ecdataframe.dataframe.DataFrame;
import io.github.vmzakharov.ecdataframe.dataset.CsvDataSet;
import io.github.vmzakharov.ecdataframe.dataset.CsvSchema;
import org.eclipse.collections.api.factory.Lists;

import java.nio.file.Path;

import static io.github.vmzakharov.ecdataframe.dataframe.AggregateFunction.*;
import static io.github.vmzakharov.ecdataframe.dsl.value.ValueType.FLOAT;
import static io.github.vmzakharov.ecdataframe.dsl.value.ValueType.STRING;

public class CalculateAverage
{
static private final String MEASUREMENT_PATH = "C:/projects/1brc";
static private final String MEASUREMENT_FILE = "measurements.txt";

public static void main(String[] args)
{
CsvSchema msSchema = new CsvSchema()
.addColumn("Station", STRING)
.addColumn("Temperature", FLOAT)
.separator(';')
.hasHeaderLine(false);

CsvDataSet msDataSet = new CsvDataSet(
Path.of(MEASUREMENT_PATH, MEASUREMENT_FILE), "measurements", msSchema);

DataFrame measurements = msDataSet.loadAsDataFrame();

DataFrame aggregated = measurements
.aggregateBy(
Lists.immutable.of(min("Temperature", "Min"), avg2d("Temperature", "Mean"), max("Temperature", "Max")),
Lists.immutable.of("Station"))
.sortBy(Lists.immutable.of("Station"));

aggregated.forEach(c ->
System.out.printf(
"%s=%2.1f/%2.1f/%2.1f\n",
c.getString("Station"), c.getFloat("Min"), c.getDouble("Mean"), c.getFloat("Max")));
}
}

For funsies, the code in the main method above can easily be written as a single statement, but I would not recommend it.

Conclusion

Like I promised at the beginning, this implementation will not win any performance or memory efficiency prizes in anything like 1BRC. However, one metric by which it does really well is the developer time and effort and the resulting code readability. Following the “Make it work, make it right, make it fast” maxim, more optimization can be implemented later if and when and where required. A DataFrame is not a universal solution to all problems but it is a helpful solution to enough problems in enough contexts that it pays to have it as another tool in your development toolbox.

--

--