Packages in Blog Posts that Use R to Analyze Covid-19 Data

Rees Morrison
Analytics Vidhya
Published in
4 min readSep 15, 2020

To help the covid19 research community and R bloggers who contribute to it, I have been collecting blog posts that combine R programming and coronavirus data (“CovidRposts”). URLs for the seven previous articles are listed at the end of this one. My goal from the beginning was to spread the knowledge of programming tools, data sources and mathematical techniques. Another line of inquiry aims to help the R community find like-minded bloggers, whether by the work role of the bloggers and their countries, particularly the prolific contributors, or the topic of their posts. Of late, I have realized that my research can say much about knowledge sharing in the community of R practitioners. If any reader would like to see the code I used for this article or to obtain the CovidRposts data, please write me at Rees(at)ReesMorrison(dot)com.

Because R researchers often want to see the code and techniques of others, for this piece let’s see how many of the posters provide their R code (either in the post itself or in a GitHub repository).

Each column of the plot represents what I refer to as the “Core Role” of the blogger -– where they work. Within each column the lower, gold segment shows the number of blog posts that provide R code. The top, dark segment shows how many of the posts offer no code.

Academics are much more likely to provide their code than corporate bloggers (22% no code for the academics compared to 30% no code for the corporate types). Surprising, even among academic bloggers close to one out of five posts do not include code. One would think they would be much more conscious of reproducibility and knowledge sharing. Secondly, even bloggers working for corporations share their code, a bit more than two-thirds of the time. One might have thought that companies would insist on their employees keeping proprietary code private (although many bloggers write as individuals, not as employees.) The other three roles and the five bloggers for whom I have not been able to identify their Core Role have scattered examples so no conclusions are apparent. That group includes 8 no-code posts and 22 code posts. When all 242 posts are analyzed, almost exactly one out of four do not share their R code (61 out of 242).

Among the posts that include code, consider next how many R packages the bloggers used. This plot breaks down the code-posts by their primary topic, which is my (admittedly subjective) assessment of the dominant theme of each post. To repeat, this plot only covers the 180 posts where code is available. The number of libraries, R packages, used in those posts appears on the x-axis. That number ranges from a single package on up to 29 packages at a maximum. We should note that these numbers have some squishiness because they come from a search for the number of times “library” appears in the text of a blog post. It could be that in a few instances someone used the word “library“ other than in code for loading a package (“library(name of package)”). Incidentally, some bloggers use “require(name of package)” but I edited my copy of the post to replace “require” with “library”.

Additionally, this plot colors the points by whether a mathematical technique was prominent or not. Approximately half of the blog posts that provided code also used a mathematical technique in a noteworthy way that the blogger stressed. For instance, the four text-mining posts that provided code all relied on math, often having to do with sparse matrices or linear algebra. About half of the epidemiology posts, however, did not employ a notable math technique. Those that did were mostly addressing SIR models and solving ordinary differential equations.

As a third perspective on the presence of R code, we created a mosaic plot. The left portion represents the blog posts that do not provide R code (either in the post proper or in a GitHub repository). The right portion represents the roughly two-thirds of posts that provide code. The horizontal bands indicate the primary topics of the posts (this plot omits the few posts that focus on text mining or statistics because the color-blind fill palette covers a maximum of eight colors). To take an example, in the code posts the largest rectangle falls under epidemiology (green); among the no-code posts, Economics in blue almost matches the number of Epidemiology.

Previous articles in the series:

June 15, 2020 [1] dates, roles and countries; June 29, 2020 [2] packages; July 8, 2020 [3] data sources and math; July 14, 2020 [4] topics; July 29, 2020 [5] R and COVID-19 terms; August 6, 2020 [6] prolific posters’ pace and match of blog posts (215) to worldwide positive cases; and August 27, 2020 [7] numbers of characters in blog posts.

--

--

Rees Morrison
Analytics Vidhya

An enthusiast of R programming, surveys, and data analysis/visualization