How to make R run really really fast!!

Piyush Pallav
1by137
Published in
10 min readSep 2, 2021

--

R is one of the best tools I have come across when it comes to crunching and manipulating data. But inherently plain vanilla R is relatively slow and a little less elegant from my personal standards (can debate on it 😃 ). So the question is how can you make R a serious data crunching tool while keeping the code base as simple and elegant as possible.

Have highlighted important areas to notice within the code blocks!

data.table for rescue!!

data.table library has been one of the most popular library within the R community. Loved for its insane memory management and ridiculous speed(for every alternate library I have seen, it either matches or exceeds in speed of execution, even respected dplyr and pandas library!). In this article I will try to convince you that you only need 5–6 necessary features and you can execute 99% your data related activity.

Also, I accept, seeing >3 crore rows or excess of 10GB files getting processed in microseconds on my dumb 16GB RAM office laptop while other person’s python code giving memory exhaustion error gives me guilt pleasure 😅.

Here are the industry accepted benchmarks for different data manipulation libraries out there from different languages:

On groupby:

data.table is second only to Polars (implemented in Rust) and obliterates other famous libraries on groupby operations

On joins:

For further read on groupby and join benchmarks, please refer to this link.

1. Installing the library

> install.packages(“data.table”)> # Loading the library
> library(data.table)

2. Reading/Writing or converting existing dataframe to data.table object

Reading and Writing with data.table package is straight forward:

> # For reading a csv file
> dataset = fread("filename.csv")
> # For writing a csv file
> fwrite(dataset, "filename.csv")

You can also convert existing dataframe to data.table object by passing:

> dataset = data.table(dataframe_name)

You can also create your own custom dataset:

> dataset = data.table(AA = c('a','a','a','b','b','b','c','c','c')
,BB = c('d','e','d','e','d','e','d','e','d')
,CC = c(10,1,31,3,17,16,27,43,34)
,DD = c('2020-04-01','2020-04-02','2020-04-03','2020-04-02','2020-04-01','2020-04-03','2020-04-02','2020-04-03','2020-04-01'))

This will result in table which looks like:

dataset

(If you are following through then please create similar table as we will be using it again and again)

3. Basics of data.table

Layout of datatable:

Here i stands for row, j stands for column. by is used when we need to do some kind of aggregation (covered later on in the article).

> # Fetching any specific row-column
> dataset[1,2]
BB
1: d
> # Fetching first two rows
> dataset[1:2]
AA BB CC DD
1: a d 10 2020-04-01
2: a e 1 2020-04-02
> # Fetching first column
> dataset[,1]
AA
1: a
2: a
3: a
4: b
5: b
6: b
7: c
8: c
9: c

Note, we only required , when we wanted to select a column, this is because the column is 2nd indexed in dt[i,j,by].

While running code by providing index numbers is fine, most of the work can be done just by using column names. Heads up: Complexity addition ahead💀

4. Conditional Slicing, Dicing and Data Aggregation

> # Cases where AA column = 'a' and CC column > 5
> dataset[AA == 'a' & CC > 5]
AA BB CC DD
1: a d 10 2020-04-01
2: a d 31 2020-04-03
> # Cases where AA column in ('b','c') and BB column = 'd', get column CC and DD
> dataset[AA %in% c('b','c') & BB == 'd', .(CC,DD)]
CC DD
1: 17 2020-04-01
2: 27 2020-04-02
3: 34 2020-04-01

As you can see, i component of dt can be used for condition assignment.
In second example, one new thing introduced is notation of dt[, .()] where all the selected column names are provided within .() .

Introducing a better way of looking at data.table

where is used for condition assignment
select is used for selecting column names used for display/ creating new columns
group by is used for grouping data during aggregation

There are 2 nomenclatures that are used within the above method:

The first nomenclature creates new dataset with all the set of modifications and allocates it to new memory while the second nomenclature makes modification/addition/deletion to the existing dataset within the existing memory.

Let's do some kind of Aggregation, things will become a bit more clear. If you are finding it confusing, bare with me. This is the most important concept of the whole lot and generally 70% work gets done by doing these operations itself.

> # Summing column CC by first nomenclature
> dataset[,.(EE = sum(CC))]
EE
1: 182
> # Summing column CC by second nomenclature
> dataset[,`:=` (EE = sum(CC))]
AA BB CC DD EE
1: a d 10 2020-04-01 182
2: a e 1 2020-04-02 182
3: a d 31 2020-04-03 182
4: b e 3 2020-04-02 182
5: b d 17 2020-04-01 182
6: b e 16 2020-04-03 182
7: c d 27 2020-04-02 182
8: c e 43 2020-04-03 182
9: c d 34 2020-04-01 182

As you can see, in both scenarios we created a new column named EE. But in first case, aggregation of CC happens and a new datatable gets created with only one entry while, in the second case, aggregation of CC happens and it gets attached to the existing datatable.

So, as a thumb rule, you can say if I need to make a new datatable with all the modifications, I will use 1st case; if I need to make modifications to existing datatable, I will use the 2nd case.

Doing some complex manipulation and saving it to new variable

> # Max of CC, sum of CC grouped at AA where BB = 'd'
> # Nomenclature 1
> dataset2 = dataset[BB == 'd'
,.(EE = max(CC),FF = sum(CC))
, by = .(AA)]
AA EE FF
1: a 31 41
2: b 17 17
3: c 34 61
> # Nomenclature 2
> dataset2 = dataset[BB == 'd'
,`:=`(EE = max(CC),FF = sum(CC))
, by = .(AA)]
AA BB CC DD EE FF
1: a d 10 2020-04-01 31 41
2: a e 1 2020-04-02 NA NA
3: a d 31 2020-04-03 31 41
4: b e 3 2020-04-02 NA NA
5: b d 17 2020-04-01 17 17
6: b e 16 2020-04-03 NA NA
7: c d 27 2020-04-02 34 61
8: c e 43 2020-04-03 NA NA
9: c d 34 2020-04-01 34 61

Note, in second case since we used := notation, no new memory creation took place. This means the modifications were made to dataset variable itself and dataset2 variable’s memory just got mapped to dataset’s memory address.

5. Command Chaining

My personal favorite!
You can chain multiple sequential operations performed on the datatable in a single line. This can enormously reduce your code length while preserving code readability.

> # sum of (Max of CC, sum of CC grouped at AA where BB = 'd')
> dataset[BB == 'd'
,`:=` (EE = max(CC), FF = sum(CC))
, by = .(AA)][,`:=` (GG = EE+FF)]
AA BB CC DD EE FF GG
1: a d 10 2020-04-01 31 41 72
2: a e 1 2020-04-02 NA NA NA
3: a d 31 2020-04-03 31 41 72
4: b e 3 2020-04-02 NA NA NA
5: b d 17 2020-04-01 17 17 34
6: b e 16 2020-04-03 NA NA NA
7: c d 27 2020-04-02 34 61 95
8: c e 43 2020-04-03 NA NA NA
9: c d 34 2020-04-01 34 61 95
# If objective is to create new table and then make operation
> dataset2 = dataset[BB == 'd'
,.(EE = max(CC), FF = sum(CC))
, by = .(AA)][,`:=` (GG = EE+FF)]
AA EE FF GG
1: a 31 41 72
2: b 17 17 34
3: c 34 61 95

6. Joining 2 tables

Joining 2 tables on some column key is one of the most common operation in Data Analytics or Data Science exercise as there can be many sources from where one may be pulling in data.

Here, X is my base datatable, Y is datatable that will get joined to X, and Z is datatable where everything will get put together. on provides the key i.e. Y joins X on colname column.

> # Creating 2nd table
> dataset2 = data.table(AA = c('a','a','b','b')
,BB = c('d','e','d','e')
,EE = c('alpha', 'gamma', 'beta', 'delta'))
AA BB EE
1: a d alpha
2: a e gamma
3: b d beta
4: b e delta
> # joining dataset2 to dataset1 on AA & BB column combination
> # Note: AA and BB combined make unique key pair
> dataset3 = dataset2[dataset, on = c('AA','BB')]
AA BB CC DD EE
1: a d 10 2020-04-01 alpha
2: a e 1 2020-04-02 gamma
3: a d 31 2020-04-03 alpha
4: b e 3 2020-04-02 delta
5: b d 17 2020-04-01 beta
6: b e 16 2020-04-03 delta
7: c d 27 2020-04-02 NA
8: c e 43 2020-04-03 NA
9: c d 34 2020-04-01 NA
> # joining dataset2 to dataset1 on AA column combination
> # Note: AA now doesn't make unique key, also BB present in both
> # datatables, if you still want to join the two tables,
> # you have to allow the join else it will show error message
> dataset3 = dataset2[dataset, on = "AA", allow = TRUE]
AA i.BB CC DD EE BB
1: a d 10 2020-04-01 alpha d
2: a d 10 2020-04-01 gamma e
3: a e 1 2020-04-02 alpha d
4: a e 1 2020-04-02 gamma e
5: a d 31 2020-04-03 alpha d
6: a d 31 2020-04-03 gamma e
7: b e 3 2020-04-02 beta d
8: b e 3 2020-04-02 delta e
9: b d 17 2020-04-01 beta d
10: b d 17 2020-04-01 delta e
11: b e 16 2020-04-03 beta d
12: b e 16 2020-04-03 delta e
13: c d 27 2020-04-02 NA NA
14: c e 43 2020-04-03 NA NA
15: c d 34 2020-04-01 NA NA

7. Data Pivoting

The ability to do data pivoting gives extra teeth to the data manipulation arsenal as sometimes data is not exactly in the correct orientation/format and might require some working to make it more consumable. Or this can be used to simplify the table structure and make it more reader-friendly.

Lets work with following dataset:

> dataset4 = data.table(AA = c('a','a','a','a','b','b','b','b')
,BB = c('abc','abc','xyz','xyz','abc','abc','xyz','xyz')
,CC = c('2020-04-01','2020-04-02','2020-04-01','2020-04-02','2020-04-01','2020-04-02','2020-04-01','2020-04-02')
,DD = c(1,2,3,4,5,6,7,8))
dataset4

7.1. Casting
Think of this as widening the table and reducing the height of the table. Let entries of AA and BB remain as rows and entries of CC become unique column names.

> dataset5 = dcast(dataset4, AA + BB ~ CC, value.var = 'DD')
AA BB 2020-04-01 2020-04-02
1: a abc 1 2
2: a xyz 3 4
3: b abc 5 6
4: b xyz 7 8

7.2. Melting
Think of this as narrowing the table and increasing the height of the table. Lets use the newly created dataset5 and roll things back to how data looked in dataset4.

> dataset6 = melt(dataset5, id.vars = c('AA','BB'), variable.name = "Date", value.name = "Values")
AA BB Dates Values
1: a abc 2020-04-01 1
2: a xyz 2020-04-01 3
3: b abc 2020-04-01 5
4: b xyz 2020-04-01 7
5: a abc 2020-04-02 2
6: a xyz 2020-04-02 4
7: b abc 2020-04-02 6
8: b xyz 2020-04-02 8

8. Few honorable mentions

8.a. Renaming columns

> # Changing columns AA and BB to AB and BC
> setnames(dataset
,old = c("AA","BB")
,new = c("AB","BC"))

8.b. Setting order

> # Setting order, AA in ascending, CC in descending
> setorder(dataset,AA,-CC)

8.c. Full outer join with no common column

> # Creating dataset2
> dataset2 = data.table(YY = c('abc','def'),ZZ = c(1,2))
> # Full outer join
> dataset3 = setkey(dataset[,c(k=1,.SD)],k)[dataset2[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]

8.d. Handling NA

> # Replacing whole table's NA with 0
> dataset[is.na(dataset)] = 0
> # Replacing case basis NA within columns with 1
> dataset[is.na(AA),`:=` (AA = 1)]

First one will replace all NA entries within the datatable with 0; Second one replaces all NA entries in AA column with 1.

--

--

Piyush Pallav
1by137

Lead logistics planning team @myntra; ❤ Data Analytics, Data Science, Mathematics