R package updates and release dates statistics and another rise of R language

Tomaz Kastrun
5 min readJan 25, 2022

--

R has been around long time and the packages have evolved through the years as well. From the initial releases, updates, to new packages. Like many open-source and community driven languages, R is not an exception. And getting the first release dates of R packages requires little bit of web scrapping and lots of fun.

CRAN — Comprehensive R Archive Network — has invested a lot of people, rules and hours of work to have the packages available for general public in tidy, ready-to-use and easy-to-use fashion.

My hypothesis is that with emergance of new languages (and statistical languages), R has seen a decline in usage but it might be recuperating. I will try to prove this with statistics of R packages on 1) initial year of release and 2) updates of packages per year.

Last R Package updates

First, let’s check the last package update dates. By loading rvest and getting the data from CRAN web site: https://cran.r-project.org/web/packages/available_packages_by_date.html we are able to turn the HTML table into usable data.frame in R.

Based on this graph, we can see that many of the R packages have been updated in past years.

Statistics on R packages date of last update

So, how many packages are there?

So we run the following code and get the graphs in buckets:

And get the table results:

# A tibble: 17 × 2
PYear nof
<dbl> <int>
1 2006 2
2 2007 1
3 2008 6
4 2009 19
5 2010 19
6 2011 39
7 2012 294
8 2013 367
9 2014 482
10 2015 687
11 2016 970
12 2017 1085
13 2018 1573
14 2019 1918
15 2020 3545
16 2021 6817
17 2022 961

So out of 18785 packages (on January 24th, 2022), 6817 have been updated in year 2021 and additional 3545 in 2020. I am leaving out the year 2022 for now.

By running a simple statistics summary over this data:

dd_y %>% 
mutate(cumsum = cumsum(nof)
,percY = nof/cumsum(nof)
,percC = cumsum(nof)/sum(nof))

and we can see how active many of the packages have been in terms of updates.

# A tibble: 17 × 5
PYear nof cumsum percY percC
<dbl> <int> <int> <dbl> <dbl>
1 2006 2 2 1 0.000106
2 2007 1 3 0.333 0.000160
3 2008 6 9 0.667 0.000479
4 2009 19 28 0.679 0.00149
5 2010 19 47 0.404 0.00250
6 2011 39 86 0.453 0.00458
7 2012 294 380 0.774 0.0202
8 2013 367 747 0.491 0.0398
9 2014 482 1229 0.392 0.0654
10 2015 687 1916 0.359 0.102
11 2016 970 2886 0.336 0.154
12 2017 1085 3971 0.273 0.211
13 2018 1573 5544 0.284 0.295
14 2019 1918 7462 0.257 0.397
15 2020 3545 11007 0.322 0.586
16 2021 6817 17824 0.382 0.949
17 2022 961 18785 0.0512 1

So majority (or 2/3) of the packages have been actively updated in last 4 years (in order to fit the latest R engine updates). A simple correlation will also support this:

#simple correlation
cor(dd_y)[1,2]

with the value of 0.69.

But the months (distribution within the year) does not play any particular importance a. But Since the year 2018 is not over yet, it might be slightly unfair. So, to further check and support this, the distribution of the updates of R packages over months, I have excluded the year 2022 and anything prior to 2010.:

#check distribution over months
dd_ym2010 <- dd_ym %>%
filter(PYear > 2010 & PYear < 2022)
boxplot(dd_ym2010$nof~dd_ym2010$month_name, main="R Packages update over months", xlab = "Month", ylab="Number of Packages")cor(dd_ym2010)[2,3]
Distribution of updates among months

So there is no clear pattern based on central values, but there more updated in spring times. Even the result of correlation

cor(dd_ym2010)[2,3]

of 0.06, supports this theory, that making it hard to draw any concrete conclusions.

Initial dates of R Package Release

To get the complete picture, not just last updates of the packages, but the complete First or initial release dates of all the packages, some further digging was involved. Again, from CRAN archive web pages, the dates of updates and number of updates have been scrapped, in order for these statistics to be prepared.

A loop over all the package archives, has resulted in in final data frame.

After leaving this part running for roughly 10 minutes, the code has successfully scraped all the archives of the CRAN web repository. But not all packages have archive folder yet. And this should mean, that there is not yet any updates for these packages (correct me, If I am wrong. thanks). So some additional data wrangling was needed:

myDataNonArchive <- dd$Package[!dd$Package %in% myData$name]
myDataNonArchive2 <- cbind(dd[dd$Package %in% myDataNonArchive,c(2,1)],1)
names(myData) <- c(“Name”,”firstRelease”,”nofUpdates”)
names(myDataNonArchive2) <- c(“Name”,”firstRelease”,”nofUpdates”)
finalArchive <- data.frame(rbind(myData, myDataNonArchive2))

And final graph of the inital release year of packages, can be plotted.

hist(year(finalArchive$firstRelease),
main = paste("Histogram of First year of R Package Release")
,xlab="Year",ylab="Number of Packages"
,col="lightblue", border="Black"
,xlim = c(1995, 2025), las=1, ylim=c(0,10000))
Initial release year of R packages

We can now get the year of initial release and updates per year in single table to get the sense of development in R community.

#Combined statistics:
finalArchiveG<- finalArchive %>%
group_by(year(finalArchive$firstRelease)) %>%
summarise(
nof_packages = n()
,numberOfUpdates = sum(nofUpdates))
finalArchiveG

We can conclude that in year 2017, we have not seen a positive trend in new package development in comparison between with the previous years. But in year 2019 this has improved dramatically, which puts R back on the map in terms of new package developments and updates of existing packages (maintaining and improving packages)

Conclusion for R language

If years 2016 and 2017 were “data science years” and golden years for R, the decline happened in 2018 but improved back in 2019 and again R is on positive trend. There are many other statistics available to prove this.

Complete code is also available on my Github repository.

--

--

Tomaz Kastrun

Data Platform MVP, Data scientist, Geek. Community is core to technology development.