Awesome write-up!
Matt Joseph

For larger data sets, you’d want to host the data somewhere other than GitHub, such as AWS. There are R packages that enable working with files on AWS, and you can spin up a large EC2 instance and try to load everything into memory.

Once that process breaks, I’d recommend using Spark to scale up to working with larger data sets. However, I’m not aware of something similar to R Markdown for Spark, so you might need to create intermediate results files using Spark, and then use R Markdown for producing visualizations. Also, I’m not sure how to manage a Spark environment so that it is reproducible.

Show your support

Clapping shows how much you appreciated Ben Weber’s story.