Ridiculously Easy Code Optimizations in R: Part 1

Looking beyond read.csv( )

Rahul Saxena
The Startup
3 min readApr 21, 2020

--

My last story generated quite a buzz in my university, as it was in regards to how I did my part to fight against a global pandemic as a university student. Many of my friends were intrigued by a specific line of code in that story. That line was:

Even with zero idea of what readRDS is,I think it would be amply clear to you that here, we are reading a data file named “cleaned_alumni_2” with a strange “.rds” extension from a folder named “input_data”.

If you arrived at the above inference yourself, congratulations, it is absolutely correct. Okay, so, “rds” much like “csv” is a file format. What makes is better though, is that it is a native R data type(more on that later). Being a native data type, doesn’t it make intuitive sense, that loading up the “rds” data format into your R script/model/dashboard would be faster than loading a generic “csv” data format file!!!

Isn’t this getting exciting? A different file format, native to R, faster loading times, isn’t this so cool?

Now, your next logical question would most probably be, what should I do if I only have my data set as a “csv” file. Don’t worry, I’ll show you how to convert that to the “rds” format in just two lines of code.

Well, that’s it, the entirety of it. The first parameter of the “saveRDS()” function was the data set that you want to save as a “rds” file and the second parameter was the new filename. Smooth. Now how do we read that data?

So, that’s the end of it. That is how you replace your “read.csv()” function with the faster “readRDS()” function.

Benchmarking :

Well it wouldn’t hurt to test our claims of speed-up, right? So, here we go!!

Benchmarking readRDS and read.csv functions

Well, the results obtained below are just amazing. It’s a speed-up by a factor of almost 7.5x. That’s some huge benefit for changing a single line of code, isn’t it 😉.

You could try to run the above test on your systems too, using the “csv” files available with you. If no “csv” files are available, then simply create a sample data set and use the write.csv() function to create one. As a bonus point, just check out the file size of your “rds file” and compare it with the “csv file” (Hint: rds file are compressed too… :D)

Now let us visually inspect the difference in the two functions:

Load the “ggplot2” and “microbenchmark” packages.

The below plot visually represents what we observed earlier in our benchmarks.

Visual plot of benchmarks of read.csv() and readRDS()

Conclusion:

The compressed file size and faster loading times of “rds”, provide humongous benefits when your script has to load a data set again and again(for example say, a deployed ML model or a R Shiny app). Reduced data size implies that you use up less resources on your hosting platform, which again is a good thing to have.

Being a computer science undergrad, I have a heavy interest in optimizations and we’ll keep on exploring this domain together 😃. Here I’ve not touched upon the various other reading/loading option provided by some external libraries, because I’d promised optimization in a single line. In future posts, we’d also explore those packages too.

You could also buy me a coffee to support my work.

Thank You and Godspeed.

--

--