Down the AUC Rabbit Hole and into Open Source: Part 2
On 2017–11–03, my eight month journey into the world of open source software reached its first major milestone when I became the new maintainer of the Metrics package in R. The path I took to get there is full of fascinating statistics and the excitement of contributing to something bigger than the collection of files on my local machine. I want to tell you the story.
This article — the second in a two part series — covers how I got started with the open source community. In case you missed it, the first article covers the motivation for the problem and an intuitive explanation of its solution. However, reading that article is not a prerequisite for reading this article.
Contributing to Open Source
While I was absorbing as much information about AUC as I could, I also checked whether various open source tools were aware of this fast algorithm. While I had never contributed to an open source project before, it was something that I always wanted to do.
Of all of the different packages I looked at, one stuck out in particular: the Metrics
package in R. It had a function that calculated auc
like this:
Other than the use of rank
instead of data.table::frank
, the function from the Metrics package was almost identical to fast_auc
from above. It was extremely satisfying to see this algorithm used elsewhere, as if it validated the effort I had invested in understanding it. However, there was one problem: on large datasets, the function experienced an integer overflow
error.
The integer overflow was caused by the fact that sum(actual == 1)
produces an integer, length(actual)
produces an integer, and subtracting two integers also produces an integer. As a result, n_pos
and n_neg
are both represented as integers when n_pos * n_neg
is executed. For large datasets, this can often lead to values greater than 2 ^ 31–1
, which is greater than the value of .Machine$integer.max
on most computers.
My First Pull Request
Fortunately, the function could be fixed in a single line. All I needed to do was change n_pos <- sum(actual == 1)
to n_pos <- as.numeric(sum(actual == 1))
. This would cast n_pos
from an integer to a double, thereby preventing integer overflow in the subsequent calculation.
When I submitted a pull request containing my fix, the package failed to build on Travis CI. However, the problem wasn’t with my PR. After looking at the other issues and pull requests in the repository, I realized that the person who had originally authored the package was no longer actively maintaining the repository or any of its tests. In fact, an issue had been recently created informing the package maintainer that CRAN had changed the maintainer status of the package to “ORPHANED”. This happens when the package maintainer is unresponsive to emails.
Initially, I was disappointed that my PR wouldn’t be accepted and that the integer overflow bug wouldn’t be fixed. However, when I read through CRAN’s policy on orphaned packages, that disappointment quickly dissipated:
Everybody is more than welcome to take over as maintainer of an orphaned
package. Simply download the package sources, make changes if necessary
(respecting original author and license!) and resubmit the package to
CRAN with your name as maintainer in the DESCRIPTION file of the
package.
It seemed like I could become the maintainer of the Metrics package and fix the bug myself!
Becoming the Maintainer
In the fall of 2017, I began my effort to fix the auc
bug and the other problems that caused Metrics
to become orphaned. Then, I’d re-submit it to CRAN. I recognized four tasks that needed to be completed for this process.
- Repository infrastructure
- Improving documentation
- Adding new functions
- Fixing bugs
Repository Infrastructure
Many open source projects use Travis CI in order to continuously test the code as new commits are added. When I set this up for Metrics
, I made sure that the package was tested with R CMD check — as-cran
and that warnings were treated as errors. This would help me be confident that my package maintained CRAN’s standards with each new change.
Next, I performed an overhaul of the unit testing infrastructure in the package. The previous maintainer had used RUnit
, which is based off of the xUnit
family of unit testing frameworks. However, I am much more familiar with the testthat
package in R. While it was a bit of manual work to copy all of the tests from one framework to another, it forced me to learn about how all of the functions in the package work.
Improving Documentation
I strongly believe in the value of good documentation. It’s a hill that I am willing to die on. As a result, I spent a few days thinking about the best way to communicate how the functions within the Metrics
package work to someone seeing them for the first time. I added working examples to every function, linked related functions together, and clarified confusing concepts. Writing documentation in R is really easy with the roxygen2
package. Thanks to Hadley Wickham and the RStudio team for providing this package and so many other useful packages.
Adding New Functions
Metrics
is a simple package that provides implementations of common machine learning metrics for regression, time series, classification, and information retrieval problems. While the set of functions provided in Metrics
is large, it does not exhaust the entire machine learning problem space.
For example, it provided an function called f1
which implements the f1-score in the context of information retrieval problems, but not for classification problems. It also provided a function called mape
, which is the mean absolute percent error, for regression problems. But it didn’t provide any of the variants of mape
such as smape
or mase
.
In my effort to re-submit the package to CRAN, I had to balance my desire to add as many functions as possible with my fear that someone else would come along and become the maintainer before me.
Fixing Bugs
I wanted to become the maintainer of the Metrics
package in order to fix a single bug in the auc
function. Fortunately, there weren’t that many other problems with the package, which allowed me to focus on improving the documentation and adding new functions.
Conclusion
Overall, the process of becoming the maintainer of the Metrics
package was incredibly rewarding. I learned about the nuances of using the ROC curve to evaluate machine learning classifiers. I gained valuable experience in optimizing the speed of R code. I was extremely happy to contribute back to the open source community.
When I received the email below from CRAN, I jumped out of my seat. Eight months of hard work had finally been realized.
But this journey is not over. My next steps are to improve my C++ skills so that I can implement a fast version of rank
to be used within Metrics::auc
.