R data.table Joins

Published in

Analytics Vidhya

16 min readAug 26, 2021

Introduction

Data.tableis a powerful modern update of the venerable old data.frame. Under the hood, the package has been tuned for blazing speed and minimal memory usage with a syntax that is sleek and spare. For the initiated, the terseness of the syntax is a blessing, but it can pose an obstacle for new users.

One area where this has been true for me is in joining multiple data.tables. The design of the package makes data.table especially good at joins, so I have found it well worth the effort to come to grips with the syntax. As I was learning, I wasn’t able to find any resources that gathered all the information I wanted into a single place. This is the document that I wished I could find. I hope it will help you to master this powerful feature of this most powerful of R packages.

Getting Started

To get off to a quick start with the data.table package itself you really can’t do better than to read the introductory vignette on the project’s official homepage.

Here, I’ll just give the briefest of overviews of the basics of reading and manipulating data from a single data.table for people who may be new to the package or just rusty. Readers who are already equipped with basic knowledge should skip ahead to the next section.

As you’ll see when you read the official documentation, the creators of the data.table package encourage users to think of operating on individual tables much as you would think about querying a relational database. The comma-delimited fields of a data.table, DT, can be mapped to corresponding features of a SQL query.

The first field, i, orders or filters the rows of tabular data. The second
field, j, selects columns of data for computations or display, while the final,
by field, serves a group-by function.

Before moving on to joins, which have a different grammar from this query grammar, let’s build a single query that exercises all of these SQL-like clauses.

First, let’s create some data to work with.

# Create a data.table to work with
library(data.table)set.seed(42)
x <- data.table( 
  id = sample(LETTERS[1:3], 10, replace = TRUE),
  value = sample(1:5, 10, replace = TRUE)
)x##         id value
##     <char> <int>
##  1:      A     4
##  2:      A     1
##  3:      A     5
##  4:      A     4
##  5:      B     2
##  6:      B     2
##  7:      B     3
##  8:      A     1
##  9:      C     1
## 10:      C     3

To illustrate a query that uses all the main SQL-like operations, let’s find (select) the mean of the value field for all (where) the B’s and C’s for each
(group by) id.

x[ id > 'A', .(mean_value = mean(value)), by = .(id)]##        id mean_value
##    <char>      <num>
## 1:      B   2.333333
## 2:      C   2.000000

We filtered the rows of x to only give us rows where id is greater than A and
passed those rows to the select step. There we took the mean of the value
field, grouped by id field, and assigned the calculated values the name
mean_value.

That’s a (super compressed) run-through of the basic logic and syntax of querying a data.table. We’ll use some of this querying logic later on when we perform some basic operations on the merged data.tables that we’ll be generating. You shouldn’t feel bad if some of the material in this section washed over you. I’ve just tried to equip you with the minimum knowledge necessary to follow the rest of the piece without boring readers who want to jump to table joins. If you found this section tough to follow, the official introduction to the package will get you up and running in no time.

Simple Equi Joins

To demonstrate joins between tables that are predicated on equality (or the
negation thereof) of variables between tables, let’s create two data tables to
work with.

set.seed(42)big <- data.table( 
  id = LETTERS[2:11],
  a = sample(101:105, 10, replace = TRUE),
  b = sample(200:300, 10)
)small <- data.table(
  id = LETTERS[1:5],
  y = sample(1:5, 5, replace = TRUE),
  z = sample(10:20, 5) 
)big##         id     a     b
##     <char> <int> <int>
##  1:      B   101   288
##  2:      C   105   236
##  3:      D   101   219
##  4:      E   101   225
##  5:      F   102   202
##  6:      G   104   240
##  7:      H   102   300
##  8:      I   102   226
##  9:      J   101   235
## 10:      K   104   204small##        id     y     z
##    <char> <int> <int>
## 1:      A     4    10
## 2:      B     2    19
## 3:      C     4    17
## 4:      D     3    15
## 5:      E     2    12

We’ll start with a very simple join with default parameters.

# Right outer join
big[small, on = .(id)]##        id     a     b     y     z
##    <char> <int> <int> <int> <int>
## 1:      A    NA    NA     4    10
## 2:      B   101   288     2    19
## 3:      C   105   236     4    17
## 4:      D   101   219     3    15
## 5:      E   101   225     2    12

We’ve joined the tables big and small by feeding small into the first
slot of big and telling data.table which field to join on with the keyword on.

If you are accustomed to the default behavior of merge.data.frame(), the results of this join may surprise you. By default, joining data.tables with this X[Y] syntax results in a right outer join with big as the left table and small as the RHS table. Here, the row from small with id == ‘A’ shows up in the result set. Since there is no corresponding record in big, its fields for that
record show up as NA. At this point the only other thing to note is that the
columns in the resulting table reflect the ordering of the tables from left to
right in the procedure call (ie, big’s columns are followed by small’s).

If we switch the order in which we name the tables to be joined, the nature of the resulting join might be even more apparent.

# Right outer join
small[big, on = .(id)]##         id     y     z     a     b
##     <char> <int> <int> <int> <int>
##  1:      B     2    19   101   288
##  2:      C     4    17   105   236
##  3:      D     3    15   101   219
##  4:      E     2    12   101   225
##  5:      F    NA    NA   102   202
##  6:      G    NA    NA   104   240
##  7:      H    NA    NA   102   300
##  8:      I    NA    NA   102   226
##  9:      J    NA    NA   101   235
## 10:      K    NA    NA   104   204

As we’ve seen, in this left_table[right_table] syntax, all joins are right outer joins by default. We need to set the nomatch parameter to change this behavior. To get an inner join you need to set the nomatch parameter to NULL or 0.

# Inner join
small[big, on = .(id), nomatch = NULL]##        id     y     z     a     b
##    <char> <int> <int> <int> <int>
## 1:      B     2    19   101   288
## 2:      C     4    17   105   236
## 3:      D     3    15   101   219
## 4:      E     2    12   101   225

Inner joins are the default result of calls to merge.data.frame(). For compatibility, merge.data.table() also executes inner joins by default.

# X[Y, nomatch = NULL] == merge(X,Y)
all(small[big, on = .(id), nomatch = NULL] == merge(small, big))## [1] TRUE

Since data.table has implemented its own merge() function, you may be asking yourself why you should leave the familiarity of merge() for this more spartan syntax. It’s a good question. In general, the non-merge X[Y] syntax is the preferred, more data.table-ish way of joining. The reason is that data.table automatically interrogates the j field to detect the minimum necessary set of columns to join to accomplish the computations specified in j. Using merge(), on the other hand, would either merge all of the columns from the RHS table whether they’re needed or not (wasteful) or would require the user to manually winnow the columns of the RHS table down to the minimum necessary (verbose and unnecessary).

So, for example, the join/query below only merges column a from big because a is the only column from the RHS table that is needed for subsequent computations.

# data.table takes care of minimizing the number of 
# joined columns
small[big, on = .(id), .('a + z' = a + z)]##     a + z
##     <int>
##  1:   120
##  2:   122
##  3:   116
##  4:   113
##  5:    NA
##  6:    NA
##  7:    NA
##  8:    NA
##  9:    NA
## 10:    NA

Obviously, the efficiency gains are minimal for this toy example, but the data.table package is all about efficiently processing large data sets where this sort of optimization might mean the difference between being able to run an analysis in memory or not.

There are two so-called joins, the anti-join and the semi-join, that don’t really return a combination of columns from two tables. To return only the rows of big that have no matching id in small is straightforward. To get this anti-join behavior you simply negate the RHS table.

# Anti-join 
# (return rows of big with no match in small)
big[!small, on = .(id)]##        id     a     b
##    <char> <int> <int>
## 1:      F   102   202
## 2:      G   104   240
## 3:      H   102   300
## 4:      I   102   226
## 5:      J   101   235
## 6:      K   104   204

A semi-join is kind of like an anti-anti-join. It returns only the rows of the LHS table that have a match in the RHS table. The syntax for this is a little ungainly, but the elements of big that match ids in small can be obtained as shown below.

# Semi-join 
# (return only rows of big that have a 
# match in small)
big[na.omit(big[small, on = .(id), which=TRUE])]##        id     a     b
##    <char> <int> <int>
## 1:      B   101   288
## 2:      C   105   236
## 3:      D   101   219
## 4:      E   101   225

The which parameter is just instructing data.table to return the indexes of rows from big that have a match in small with NA returned for indexes of small that have no match in big. We then feed those row indexes, minus the NA values, into the i field of an outer query on big.

Finally, what can you do if you want to join all available data from both tables? A clever idea for a full outer join with X[Y] syntax emerged from this classic Stackoverflow exchange.

Implementing the Stackoverflow idea requires us to take a slight detour to introduce the idea of keys. Up to now, we have specified the joining variable each time with the on parameter. We can make id the implicit joining variable in all subsequent joins by setting it as a key in both tables. This makes joins very fast at the (small) cost of an upfront referential ordering of the variables. It also streamlines the syntax available for selecting rows based on the keys. Let’s see that feature first.

setkey(small, id)
setkey(big, id)# With keys set, we don't need to specify 
# columns to join on
big[c('A','C', 'E' )]##        id     a     b
##    <char> <int> <int>
## 1:      A    NA    NA
## 2:      C   105   236
## 3:      E   101   225

Above, data.table “knew” that we wanted the rows of big that corresponded to ids A, C and E. This type of row indexing comes into play in the full outer join below.

# Full outer join
unique_ids = unique(c(big[, id], small[, id]))
big[small[unique_ids]]##         id     a     b     y     z
##     <char> <int> <int> <int> <int>
##  1:      B   101   288     2    19
##  2:      C   105   236     4    17
##  3:      D   101   219     3    15
##  4:      E   101   225     2    12
##  5:      F   102   202    NA    NA
##  6:      G   104   240    NA    NA
##  7:      H   102   300    NA    NA
##  8:      I   102   226    NA    NA
##  9:      J   101   235    NA    NA
## 10:      K   104   204    NA    NA
## 11:      A    NA    NA     4    10

Obviously, whether big is the LHS or the RHS table only affects the ordering of the columns in the joined tables when we perform a full outer join.

Before we leave this section, let’s explicitly demonstrate a self join, the joining of a table on itself to effect some transformation. The first data.table we created in this tutorial, x, had duplicate values in the id field. Let’s say we want to pick off just the records with the first occurrence of each id value. That can be accomplished by setting the mult (for multiple matches) parameter to ‘first’. Other legal values for mult are ‘all’ (the default) and ‘last’.

# Self-join
setkey(x, id)
x[unique(id), mult = 'first']##        id value
##    <char> <int>
## 1:      A     4
## 2:      B     2
## 3:      C     1

Both to illustrate how it’s done and because I prefer the explicitness of the on syntax for pedagogic purposes, I’m going to remove the keys from these tables. In real research, I almost always would use keys, though.

setkey(x, NULL)
setkey(small, NULL)
setkey(big, NULL)# Now when we try to access rows as we did when
#  id was a key, we get an error ...
big[c('A','C', 'E' )]## Error in `[.data.table`(big, c("A", "C", "E")): When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

Before moving on to the next section, the truly well-crafted error message above deserves your reverence.

Non-Equi Joins

So far, all the joins we’ve looked at have involved equality between joining columns (or the negation of equality). Joins based on inequalities are also natively supported by data.table.

Let’s start exploring this powerful feature by loading the Red Sox’ home schedule for September 2021 into a data.table with fread(), the package’s freakishly quick, insanely smart file reader. In the snippet below, take a moment to appreciate fread()’s data-type recognition. It just brings a smile to my face every time. 😄

# Red Sox home schedule
fenway <- fread(
    "Visitor, Start, End
    Cleveland, 2021-09-03, 2021-09-05
    Tampa Bay, 2021-09-06, 2021-09-08
    Baltimore, 2021-09-17, 2021-09-19
    NY Mets, 2021-09-21, 2021-09-22
    NY Yankees, 2021-09-24, 2021-09-26")fenway##       Visitor      Start        End
##        <char>     <IDat>     <IDat>
## 1:  Cleveland 2021-09-03 2021-09-05
## 2:  Tampa Bay 2021-09-06 2021-09-08
## 3:  Baltimore 2021-09-17 2021-09-19
## 4:    NY Mets 2021-09-21 2021-09-22
## 5: NY Yankees 2021-09-24 2021-09-26

Let’s also generate a data.table with all the dates in September 2021.

# All the days of September 2021
september <- 
   data.table(date = seq(as.Date('2021-09-01'), 
   as.Date('2021-09-30'), 'days'))head(september)##          date
##        <Date>
## 1: 2021-09-01
## 2: 2021-09-02
## 3: 2021-09-03
## 4: 2021-09-04
## 5: 2021-09-05
## 6: 2021-09-06

By right outer joining the fenway table to the dates of the month using the range of start and end dates for each homestand and performing a little clean-up, we can obtain a full list of Red Sox home games!

# The dates of all Red Sox home games in September 2021
home_games <- 
  fenway[september, on = .(Start <= date, End >= date), 
  .(Visitor, Date = Start)][!is.na(Visitor)]home_games##        Visitor       Date
##         <char>     <Date>
##  1:  Cleveland 2021-09-03
##  2:  Cleveland 2021-09-04
##  3:  Cleveland 2021-09-05
##  4:  Tampa Bay 2021-09-06
##  5:  Tampa Bay 2021-09-07
##  6:  Tampa Bay 2021-09-08
##  7:  Baltimore 2021-09-17
##  8:  Baltimore 2021-09-18
##  9:  Baltimore 2021-09-19
## 10:    NY Mets 2021-09-21
## 11:    NY Mets 2021-09-22
## 12: NY Yankees 2021-09-24
## 13: NY Yankees 2021-09-25
## 14: NY Yankees 2021-09-26

For those who may not have seen it before, I should note that the code segment between the last two square brackets ([!is.na(Visitor)]) is a form of operation chaining that allows us to refine results in a previous operation, here removing rows where the Visitor field is empty.

Before moving on to rolling joins, I need to note a big syntactical gotcha: The ordering of the terms in the inequalities (or the equalities, for that matter) is significant. This logically equivalent–and, to my mind, more intuitive–attempt to execute the operation will fail.

# The dates of all Red Sox home games in September 2021
fenway[september, on = .(date >= Start, date <= End), 
       .(Visitor, Date = Start)][!is.na(Visitor)]## Error in colnamesInt(x, names(on), check_dups = FALSE): argument specifying columns specify non existing column(s): cols[1]='date'

Okay, that error message is a little hard to grok, but the left side of the relation in the join statement has to correspond to the LHS table and vice versa for the RHS table. Start is a LHS field and date is a RHS field, so the ordering has to be Start <operator> date, eg. When you think about it, this behavior is perfectly sensible. The two tables may have columns with clashing names, and this ordering of terms allows easy disambiguation.

Rolling Joins

A rolling join matches between tables, carrying forward or looking ahead to the next match until there is a change in the matching condition. Let’s continue to use our home_games and september tables to see an admittedly contrived example. We’ll build up a list of all dates in September and, for each date, we’ll track the last home opponent faced. It’s a little hard to think of why this might be useful, but it exercises the functionality.

By way of reminder, a simple right outer join between the two tables would produce gaps in the Visitor field, the gaps that I removed by tacking [!is.na(Visitor)] onto the end of the inequality join in the last section. It’s these gaps that we want to now fill in with the last populated value.

# Without !is.na(Visitor)home_games[september, on = .(Date == date)]##        Visitor       Date
##         <char>     <Date>
##  1:       <NA> 2021-09-01
##  2:       <NA> 2021-09-02
##  3:  Cleveland 2021-09-03
##  4:  Cleveland 2021-09-04
##  5:  Cleveland 2021-09-05
##  6:  Tampa Bay 2021-09-06
##  7:  Tampa Bay 2021-09-07
##  8:  Tampa Bay 2021-09-08
##  9:       <NA> 2021-09-09
## 10:       <NA> 2021-09-10
## 11:       <NA> 2021-09-11
## 12:       <NA> 2021-09-12
## 13:       <NA> 2021-09-13
## 14:       <NA> 2021-09-14
## 15:       <NA> 2021-09-15
## 16:       <NA> 2021-09-16
## 17:  Baltimore 2021-09-17
## 18:  Baltimore 2021-09-18
## 19:  Baltimore 2021-09-19
## 20:       <NA> 2021-09-20
## 21:    NY Mets 2021-09-21
## 22:    NY Mets 2021-09-22
## 23:       <NA> 2021-09-23
## 24: NY Yankees 2021-09-24
## 25: NY Yankees 2021-09-25
## 26: NY Yankees 2021-09-26
## 27:       <NA> 2021-09-27
## 28:       <NA> 2021-09-28
## 29:       <NA> 2021-09-29
## 30:       <NA> 2021-09-30
##        Visitor       Date

We can carry forward the name of the last home opponent simply by adding roll = TRUE to the previous join.

# Roll the name of the last home opponent forwardhome_games[september, 
           on = .(Date == date), roll = TRUE]##        Visitor       Date
##         <char>     <Date>
##  1:       <NA> 2021-09-01
##  2:       <NA> 2021-09-02
##  3:  Cleveland 2021-09-03
##  4:  Cleveland 2021-09-04
##  5:  Cleveland 2021-09-05
##  6:  Tampa Bay 2021-09-06
##  7:  Tampa Bay 2021-09-07
##  8:  Tampa Bay 2021-09-08
##  9:  Tampa Bay 2021-09-09
## 10:  Tampa Bay 2021-09-10
## 11:  Tampa Bay 2021-09-11
## 12:  Tampa Bay 2021-09-12
## 13:  Tampa Bay 2021-09-13
## 14:  Tampa Bay 2021-09-14
## 15:  Tampa Bay 2021-09-15
## 16:  Tampa Bay 2021-09-16
## 17:  Baltimore 2021-09-17
## 18:  Baltimore 2021-09-18
## 19:  Baltimore 2021-09-19
## 20:  Baltimore 2021-09-20
## 21:    NY Mets 2021-09-21
## 22:    NY Mets 2021-09-22
## 23:    NY Mets 2021-09-23
## 24: NY Yankees 2021-09-24
## 25: NY Yankees 2021-09-25
## 26: NY Yankees 2021-09-26
## 27: NY Yankees 2021-09-27
## 28: NY Yankees 2021-09-28
## 29: NY Yankees 2021-09-29
## 30: NY Yankees 2021-09-30
##        Visitor       Date

The direction of the roll can be reversed by specifying roll = -Inf.

Overlapping Range Joins

We have already seen a data.table with data in the form of ranges. The fenway table listed Red Sox home games by date range. Let’s say we wanted to keep an eye on who the Yankees were playing during the home games. The data.table package provides the foverlaps() function to join tables by overlapping ranges. (The ‘f’ in the function name stands for fast. The same is true for the fread(), fwrite(), and pretty much all of the other functions in the package that start with f. Speed is the name of the game with this package.)

Let’s create a table with the first few September match-ups for the Yankees.

yankees <- fread(
  "NYY Opponent, Start, End
  Anaheim, 2021-09-01, 2021-09-01
  Baltimore, 2021-09-03, 2021-09-05
  Toronto, 2021-09-06, 2021-09-09"
)setkey(yankees, Start, End)

Joining these two tables by overlapping dates couldn’t be easier. (What is hard is formatting the output to print well on Medium, but that’s a story for another time.)

foverlaps(fenway, yankees)NYY Opponent  Start   End        Visitor    i.Start    i.End
<char>     <IDat>     <IDat>     <char>     <IDat>     <IDat>
Baltimore  2021-09-03 2021-09-05 Cleveland  2021-09-03 2021-09-05
Toronto    2021-09-06 2021-09-09 Tampa Bay  2021-09-06 2021-09-08
<NA>       <NA>       <NA>       Baltimore  2021-09-17 2021-09-19
<NA>       <NA>       <NA>       NY Mets    2021-09-21 2021-09-22
<NA>       <NA>       <NA>       NY Yankees 2021-09-24 2021-09-26

Conclusion

So that’s about all I know (or think I know) about joining data.tables.

Like it’s distinguished forbearer before it, R’s data.frame, the data.table package has inspired imitation in the Python world, so it’s possible that some of what we have talked about here will help you with Python work.

Whether in R or in Python, I hope this tutorial has been helpful.