Ridiculously Easy Code Optimizations in R: Part 1
Looking beyond read.csv( )
My last story generated quite a buzz in my university, as it was in regards to how I did my part to fight against a global pandemic as a university student. Many of my friends were intrigued by a specific line of code in that story. That line was:
test_df <- readRDS("input_data/cleaned_alumni_2.rds")
Even with zero idea of what readRDS is,I think it would be amply clear to you that here, we are reading a data file named “cleaned_alumni_2” with a strange “.rds” extension from a folder named “input_data”.
If you arrived at the above inference yourself, congratulations, it is absolutely correct. Okay, so, “rds” much like “csv” is a file format. What makes is better though, is that it is a native R data type(more on that later). Being a native data type, doesn’t it make intuitive sense, that loading up the “rds” data format into your R script/model/dashboard would be faster than loading a generic “csv” data format file!!!
Isn’t this getting exciting? A different file format, native to R, faster loading times, isn’t this so cool?
Now, your next logical question would most probably be, what should I do if I only have my data set as a “csv” file. Don’t worry, I’ll show you how to convert that to the “rds” format in just two lines of code.
original_dataset <- read.csv("filename.csv")
saveRDS(original_dataset, "new_filename.rds")
Well, that’s it, the entirety of it. The first parameter of the “saveRDS()” function was the data set that you want to save as a “rds” file and the second parameter was the new filename. Smooth. Now how do we read that data?
converted_dataset <- readRDS("new_filename.rds")
So, that’s the end of it. That is how you replace your “read.csv()” function with the faster “readRDS()” function.
Benchmarking :
Well it wouldn’t hurt to test our claims of speed-up, right? So, here we go!!
Well, the results obtained below are just amazing. It’s a speed-up by a factor of almost 7.5x. That’s some huge benefit for changing a single line of code, isn’t it 😉.
You could try to run the above test on your systems too, using the “csv” files available with you. If no “csv” files are available, then simply create a sample data set and use the write.csv() function to create one. As a bonus point, just check out the file size of your “rds file” and compare it with the “csv file” (Hint: rds file are compressed too… :D)
Now let us visually inspect the difference in the two functions:
The below plot visually represents what we observed earlier in our benchmarks.
Conclusion:
The compressed file size and faster loading times of “rds”, provide humongous benefits when your script has to load a data set again and again(for example say, a deployed ML model or a R Shiny app). Reduced data size implies that you use up less resources on your hosting platform, which again is a good thing to have.
Being a computer science undergrad, I have a heavy interest in optimizations and we’ll keep on exploring this domain together 😃. Here I’ve not touched upon the various other reading/loading option provided by some external libraries, because I’d promised optimization in a single line. In future posts, we’d also explore those packages too.
You could also buy me a coffee to support my work.
Thank You and Godspeed.