Set Operations With Data Frames

If the only tool you have is a data frame…

Vladimir Zakharov
6 min readNov 21, 2023
A drawing of a hammer contemplating a collection of Venn diagrams

Introduction

Don Raab recently posted a blog comparing set operations between the Java Collections Framework that comes with the JDK and Eclipse Collections.

Paul King followed it up with a blog of his own, showing how to do same (and more) operations in Groovy.

I was curious what it would looks like to implement these operations using a data frame. A data frame is basically a tabular data set that can be manipulated programmatically. Think of it as a logical equivalent of a database table, where columns can contain values of different types (unlike, say, a two-dimensional array in Java).

This post will show how to perform the set operations using a Java data frame implementation — in this case I will use the dataframe-ec framework, <disclaimer> for which I happen to be the maintainer </disclaimer>. Dataframe-ec internally uses Eclipse Collections and exposes Eclipse Collection data structures, taking advantage of Eclipse Colections efficiency and rich APIs.

Note that if you are dealing with a single list (vector) of values then using a data frame for these operations is most likely an overkill and you are going to be better off using Eclipse Collections (if available in your environment) or one of the JDK based approaches.

In his blog Don used sets of emojis corresponding to the months of the Fall season. To make the examples here more data-frame-y we are going to have a data frame with two columns — emojis and their descriptions — for each of the three months.

Setting Up Data (Frames)

Following Don’s example we need to define three data frames to store three sets of emojis.

private static DataFrame DF_OCTOBER;
private static DataFrame DF_NOVEMBER;
private static DataFrame DF_DECEMBER;

To make it a tad more interesting we are going to use three different approaches for loading data into a data frame:

  1. Load data by columns
  2. Load data by rows
  3. Load data from a CSV file. For these examples we want the code to be self contained so we are going to “fake” having an actual file in the file system. Instead we will override the createReader() method on the instance of the CsvDataSet class used to create the data frame so that the data will be read from a literal string (see below) rather than from a file.

Methods 1 and 2 are especially convenient for loading unit test data. In real life scenarios you probably want to load data frame from a file (that is, like method 3 but with an actual file) or programmatically.

DF_OCTOBER = new DataFrame("October")
.addStringColumn(
"Emoji",
Lists.immutable.of("🍁", "🍃", "🎃", "🥧", "🍫", "🌝",
"👻", "🌽", "🍎", "🪵", "☕️"))
.addStringColumn(
"Text",
Lists.immutable.of("maple leaf", "leaf in the wind", "pumpkin", "pie", "chocolate", "moon face",
"ghost", "corn", "apple", "log", "coffee"));
DF_OCTOBER.seal();

DF_NOVEMBER = new DataFrame("November")
.addStringColumn("Emoji").addStringColumn("Text")
.addRow("🍂", "fallen leaf")
.addRow("🍁", "maple leaf")
.addRow("🥧", "pie")
.addRow("🦃", "turkey")
.addRow("🌽", "corn")
.addRow("🍎", "apple")
.addRow("🪵", "log")
.addRow("☕️", "coffee")
;

DF_DECEMBER = new CsvDataSet("", "December") {
@Override
protected Reader createReader() {
return new StringReader("""
Emoji,Text
🥧, "pie"
🦃, "turkey"
⛄️, "snowman"
🍎, "apple"
🌰, "chestnut"
🛷, "sled"
🪵, "log"
🎄, "tree"
☕️, "coffee"
""");
}
}.loadAsDataFrame();

Refresher: Set Operations

We will look at the following four set operations here

  1. Union. The union of a collection of sets is a set of all elements in them. For two sets A and B, the union of those sets, A B in the set algebra notation, is a set that contains elements that are either in A, in B, or in both A and B.
  2. Intersection. The intersection of sets two sets A and B, A B, is a set that contains elements in A that are also in B (or equivalently, elements in B that are also in A).
  3. Difference (Relative Complement). For two sets A and B the set difference of A and B, aka the relative complement of B in A, AB, is the set of elements in A that are not in B.
  4. Symmetric Difference. The symmetric difference of two sets A and B, A B, is the set of elements which are either in A or in B but not in their intersection.
Venn diagrams of the aforementioned four set operations

Union

The union operation implemented in dataframe-ec is more like the UNION ALL database operation in that it will not remove duplicate rows. To get rid of the duplicates we are going to call the distinct() method on the end result of union-ing our three data frames.

@Test
public void union()
{
DataFrame union = DF_OCTOBER
.union(DF_NOVEMBER)
.union(DF_DECEMBER)
.distinct();

DataFrameUtil.assertEqualsIgnoreOrder(
new DataFrame("expected")
.addStringColumn("Emoji").addStringColumn("Text")
.addRow("☕️", "coffee")
.addRow("⛄️", "snowman")
.addRow("🌝", "moon face")
.addRow("🌰", "chestnut")
.addRow("🌽", "corn")
.addRow("🍁", "maple leaf")
.addRow("🍂", "fallen leaf")
.addRow("🍃", "leaf in the wind")
.addRow("🍎", "apple")
.addRow("🍫", "chocolate")
.addRow("🎃", "pumpkin")
.addRow("🎄", "tree")
.addRow("👻", "ghost")
.addRow("🛷", "sled")
.addRow("🥧", "pie")
.addRow("🦃", "turkey")
.addRow("🪵", "log")
,
union
);
}

DataFrame.assertEqualsIgnoreOrder(...) is a utility method that does pretty much what it says on the tin.

Intersection

There is no specific intersection operation in dataframe-ec, however there is a join operation that can produce the same result. The join operation on a data frame takes as its parameters the data frame to join to and lists of columns in each data frame to join by (the join keys). This operation is a logical equivalent of a relational inner join, so rows for which there is no match in the other data frame will not be present in the join — which is exactly what we expect from the intersection of two sets. Here we are computing the intersection of all three data frames (OCTOBER NOVEMBERDECEMBER) so we are calling join twice.

@Test
public void intersection()
{
DataFrame intersection = DF_OCTOBER
.join(DF_NOVEMBER,
Lists.immutable.of("Emoji", "Text"),
Lists.immutable.of("Emoji", "Text"))
.join(DF_DECEMBER,
Lists.immutable.of("Emoji", "Text"),
Lists.immutable.of("Emoji", "Text"));

DataFrameUtil.assertEqualsIgnoreOrder(
new DataFrame("Expected")
.addStringColumn("Emoji").addStringColumn("Text")
.addRow("☕️","coffee")
.addRow("🍎","apple")
.addRow("🥧","pie")
.addRow("🪵","log")
, intersection
);
}

Difference (Complement)

This follows the example from the original blog that computes the difference of three sets (that is, for example, the elements in OCTOBER that are not in NOVEMBER nor in DECEMBER). dataframe-ec does not have a separate difference operation but it does have a more general joinWithComplements operation, which for two data frames computes the intersection and the respecitve complements of these sets based on the provided keys. The results are returned as a triplet (an Eclipse Collection data structure) of data frames, containing

  1. The complement of the data frame B in A (the difference A B)
  2. The intersection of A and B (A B, the join of two data frames)
  3. The complement of the data frame A in B (the difference B A)

So if we are only looking for the set difference we can just pick the first element of the triplet and ignore the rest:

@Test
public void difference()
{
// entires in october that are not in november
DataFrame octoberComplementInNovember = DF_OCTOBER
.joinWithComplements(
DF_NOVEMBER,
Lists.immutable.of("Emoji"), Lists.immutable.of("Emoji")
).getOne();

// entires in october that are neither in november nor december
DataFrame octoberComplementInNovemberAndDecember = octoberComplementInNovember
.joinWithComplements(
DF_DECEMBER,
Lists.immutable.of("Emoji"), Lists.immutable.of("Emoji")
).getOne();

DataFrameUtil.assertEquals(
new DataFrame("Expected")
.addStringColumn("Emoji").addStringColumn("Text")
.addRow("🌝","moon face")
.addRow("🍃","leaf in the wind")
.addRow("🍫","chocolate")
.addRow("🎃","pumpkin")
.addRow("👻","ghost"),
octoberComplementInNovemberAndDecember
);
}

Symmetric Difference

We are going to use joinWithComplements to compute the symmetric difference of two sets. The symmetric difference of two sets is equivalent to to the union of their respective relative complements or in set algebra notation:
A △ B =
(A B) ⋃ (B A)

So taking advantage of that, the symmetric difference between OCTOBER and NOVEMBER can be computed as follows:

@Test
public void symmetricDifference()
{
Triplet<DataFrame> split = DF_OCTOBER.joinWithComplements(
DF_NOVEMBER,
Lists.immutable.of("Emoji", "Text"),
Lists.immutable.of("Emoji", "Text"));

DataFrame symmetricDifference = split.getOne().union(split.getThree());

DataFrameUtil.assertEquals(
new DataFrame("Expected").addStringColumn("Emoji").addStringColumn("Text")
.addRow("🌝","moon face")
.addRow("🍃","leaf in the wind")
.addRow("🍫","chocolate")
.addRow("🎃","pumpkin")
.addRow("👻","ghost")
.addRow("🍂","fallen leaf")
.addRow("🦃","turkey")
, symmetricDifference
);
}

Conclusion

Data frames are useful data structures for ad hoc manipulation of tabular data, making it easy to try things like filtering, aggregation, transformation, enrichment — the types of things you might want to use excel tables for. Unlike excel based data manipulation though, these ad-hoc transformations can then be captured natively in your program in the same language as the rest of the application. There is no need to define classes or records, there is no need to know exactly what each column type is.

Dataframe-EC is an implementation of the data frame data structure in Java. To learn more about it see the project home on GitHub. For more realistic data frame operation scenarios and a way to learn data frame APIs take a look at the Dataframe-EC code kata.

While this implementation of the data frame data structure does not support set operations directly, all common set operations can be relatively easily performed with it. Of course for single lists of values the native implementation of sets results in more concise and more efficient code, especially if a framework like Eclipse Collections is used.

--

--