Down the AUC Rabbit Hole and into Open Source: Part 2

Michael Frasco
Convoy Tech
Published in
5 min readApr 5, 2018

On 2017–11–03, my eight month journey into the world of open source software reached its first major milestone when I became the new maintainer of the Metrics package in R. The path I took to get there is full of fascinating statistics and the excitement of contributing to something bigger than the collection of files on my local machine. I want to tell you the story.

This article — the second in a two part series — covers how I got started with the open source community. In case you missed it, the first article covers the motivation for the problem and an intuitive explanation of its solution. However, reading that article is not a prerequisite for reading this article.

Contributing to Open Source

While I was absorbing as much information about AUC as I could, I also checked whether various open source tools were aware of this fast algorithm. While I had never contributed to an open source project before, it was something that I always wanted to do.

Of all of the different packages I looked at, one stuck out in particular: the Metrics package in R. It had a function that calculated auc like this:

Other than the use of rank instead of data.table::frank, the function from the Metrics package was almost identical to fast_auc from above. It was extremely satisfying to see this algorithm used elsewhere, as if it validated the effort I had invested in understanding it. However, there was one problem: on large datasets, the function experienced an integer overflow error.

The integer overflow was caused by the fact that sum(actual == 1) produces an integer, length(actual) produces an integer, and subtracting two integers also produces an integer. As a result, n_pos and n_neg are both represented as integers when n_pos * n_neg is executed. For large datasets, this can often lead to values greater than 2 ^ 31–1, which is greater than the value of .Machine$integer.max on most computers.

My First Pull Request

Fortunately, the function could be fixed in a single line. All I needed to do was change n_pos <- sum(actual == 1) to n_pos <- as.numeric(sum(actual == 1)). This would cast n_pos from an integer to a double, thereby preventing integer overflow in the subsequent calculation.

When I submitted a pull request containing my fix, the package failed to build on Travis CI. However, the problem wasn’t with my PR. After looking at the other issues and pull requests in the repository, I realized that the person who had originally authored the package was no longer actively maintaining the repository or any of its tests. In fact, an issue had been recently created informing the package maintainer that CRAN had changed the maintainer status of the package to “ORPHANED”. This happens when the package maintainer is unresponsive to emails.

Initially, I was disappointed that my PR wouldn’t be accepted and that the integer overflow bug wouldn’t be fixed. However, when I read through CRAN’s policy on orphaned packages, that disappointment quickly dissipated:

Everybody is more than welcome to take over as maintainer of an orphaned
package. Simply download the package sources, make changes if necessary
(respecting original author and license!) and resubmit the package to
CRAN with your name as maintainer in the DESCRIPTION file of the
package.

It seemed like I could become the maintainer of the Metrics package and fix the bug myself!

Becoming the Maintainer

In the fall of 2017, I began my effort to fix the auc bug and the other problems that caused Metrics to become orphaned. Then, I’d re-submit it to CRAN. I recognized four tasks that needed to be completed for this process.

  1. Repository infrastructure
  2. Improving documentation
  3. Adding new functions
  4. Fixing bugs

Repository Infrastructure

Many open source projects use Travis CI in order to continuously test the code as new commits are added. When I set this up for Metrics, I made sure that the package was tested with R CMD check — as-cran and that warnings were treated as errors. This would help me be confident that my package maintained CRAN’s standards with each new change.

Next, I performed an overhaul of the unit testing infrastructure in the package. The previous maintainer had used RUnit, which is based off of the xUnit family of unit testing frameworks. However, I am much more familiar with the testthat package in R. While it was a bit of manual work to copy all of the tests from one framework to another, it forced me to learn about how all of the functions in the package work.

Improving Documentation

I strongly believe in the value of good documentation. It’s a hill that I am willing to die on. As a result, I spent a few days thinking about the best way to communicate how the functions within the Metrics package work to someone seeing them for the first time. I added working examples to every function, linked related functions together, and clarified confusing concepts. Writing documentation in R is really easy with the roxygen2 package. Thanks to Hadley Wickham and the RStudio team for providing this package and so many other useful packages.

Adding New Functions

Metrics is a simple package that provides implementations of common machine learning metrics for regression, time series, classification, and information retrieval problems. While the set of functions provided in Metrics is large, it does not exhaust the entire machine learning problem space.

For example, it provided an function called f1 which implements the f1-score in the context of information retrieval problems, but not for classification problems. It also provided a function called mape, which is the mean absolute percent error, for regression problems. But it didn’t provide any of the variants of mape such as smape or mase.

In my effort to re-submit the package to CRAN, I had to balance my desire to add as many functions as possible with my fear that someone else would come along and become the maintainer before me.

Fixing Bugs

I wanted to become the maintainer of the Metrics package in order to fix a single bug in the auc function. Fortunately, there weren’t that many other problems with the package, which allowed me to focus on improving the documentation and adding new functions.

Conclusion

Overall, the process of becoming the maintainer of the Metrics package was incredibly rewarding. I learned about the nuances of using the ROC curve to evaluate machine learning classifiers. I gained valuable experience in optimizing the speed of R code. I was extremely happy to contribute back to the open source community.

When I received the email below from CRAN, I jumped out of my seat. Eight months of hard work had finally been realized.

But this journey is not over. My next steps are to improve my C++ skills so that I can implement a fast version of rank to be used within Metrics::auc.

--

--

Michael Frasco
Convoy Tech

Data Scientist at Convoy building a automated marketplace for transportation.