Citations, credit, and the yt method paper

This is an edited and somewhat rewritten version of an email I sent to the yt steering committee a few months ago. At the time, I mentioned that I’d written something up about software citation (from the perspective of a project creator/developer) and some folks expressed interest, so I’ve edited it and put it here; the postscript at the end explains some actual next steps which are a bit removed from the idealistic next steps in the main body.

Additionally, when I wrote this piece, it felt like one of the more important things on my agenda. Since that time, it has languished in draft state as I couldn’t bring myself to finish it up; instead, since the second week of November I’ve mostly been descending into obsession over the fractally horrific nightmare from which it is impossible to wake.

A couple months ago, I posted an article that touched on some issues around credit in software projects. I can’t get it off my mind. After an unexpected, extended conversation with one of the Data Curation Specialists at UIUC, Elizabeth Wickes, where she shared with me some of the intricacies of DataCite metadata and citation practices, I decided it was time to take some action. I’m going to frame this discussion in the context of yt, since that’s the project that I’m the most deeply involved in, but there’s a possibility it’ll apply elsewhere.

In recent memory, there has been no shortage of discussion about how to handle citation of software in the scholarly record. Look no further than the extensive discussions at the FORCE11 Software Citation Working Group (and the affiliated paper) or initiatives and journals such as JORS, JOSS, WSSSPE, ImpactStory, Depsy, and so many others, to see the outstanding contributions both prescriptive and descriptive around citations and citation principles for software. Almost all of these revolve around the fundamental concept of identifying and entity-izing what constitutes a piece of “software.”

The approach we took with yt doesn’t predate all of these efforts, but it certainly predates my awareness of them. In 2010, a few of us wrote up a “method paper” that described the algorithms, motivation behind design decision, some things I thought were innovative or novel at the time, and so on, and this was published in early 2011. At the time, we had relatively few contributors to the yt codebase, and I decided to include seven individuals on the paper’s author list. (At present, of those seven, three remain on the yt steering committee.) In the nearly six years since that paper was published, it remains our primary method of citation, despite the fact that our community has grown to over 100 contributors. The number of data formats that yt supports has grown to several dozen, the entire underlying data system has been refactored, and we’ve expanded functionality in so many ways. And the reason we haven’t published a new paper is simple: it has not been pushed by the community leaders (including and especially me). The position that puts us in is that the definitive citation for yt includes a minority (by count and by volume of contribution) of the developers and community members.

This situation needs to be remedied. We want people to be recognized for their work, and to feel that they are recognized.

We identified a couple criteria for “fixing” the problem of credit for contributions to yt.

  • Support our community members.
  • Don’t rely on an intentionally temporary solution. If we do this, let’s not do it in a way we already know we’ll have to redo.
  • Be a little aspirational. We can temper our goals with reality, but let’s also try to be a bit optimistic.

How do we do this?

Support the Community

Academia is — at least, at present — built around a citation economy. The citation is a one-bit signal emitted by a paper or an author to indicate … well, to indicate one of a number of things. These signals get collected and drilled down into a couple numbers — citation count, h-index, etc. Despite their deeply problematic and reductionist nature, these metrics are used to measure things like impact, and in the case of citation targets that are proxies for software tools, they are used as a measure of the utilization of a piece of software. (Citing software itself is an interesting point I will return to below, but for our purposes I feel that without additional changes, it’s a distinction without a difference.)

The way that yt has been used up to the present, though, includes a dangerous breaking of symmetry. If someone hears, “Person A works on yt” it’s very easy to go to Person A’s “official” list of works and not see the yt method paper on that list. On the other hand, it’s also quite easy to go to the yt method paper and see the list of things its been used in. We’ve attempted to mitigate this asymmetry by building a “membership” system to recognize contributors to the project (both code and non-code), but this is not sufficient. The people who are a part of the community need some “official” recognition as well.

The primary developer community of yt is drawn from astronomy (although we have ambitions on broadening), which brings with it one of the greatest benefits possible: the SAO/NASA Astrophysics Data Service (ADS). This is the digital library for astrophysics; you can log on, search, and find everything. Most importantly for metrics fans, it also provides a citation count. If you search for an author, it lists the citation counts of each paper; when you’re on the page for a paper, it tells you how many times it has been cited. (Here’s the yt method paper page. Note that the Google Scholar page for the same gives a slightly different citation count.)

In an ideal world, what we want is to be able to say, for every published paper that used yt, which version of the software was used — this is useful from the perspective of ensuring reproducible and replicable science, but for our purposes, the most important aspect of this clarity is that it describes who contributed to that version. In this ideal world (where, incidentally, they have all the best stuff) everyone would receive cumulative credit for their contributions — all papers that cite a version or later would show up in their citation count and metrics. As it stands, not only do most of the contributors to the yt ecosystem not get credit, but this preferentially hurts the more vulnerable members of the community: the early-stage researchers, such as graduate students and postdocs, and especially those who have provided “free” labor.

So here is what we want: people to be added to a list of contributors, and for their metrics to reflect that. Goal identified.

Avoid Stopgaps

The obvious answer, which was suggested by a surprising number of people, was to simply write another method paper. Publish another one, add everybody who was an author, then suggest everybody just start citing that one instead of the old one.

This is a fine idea from the perspective of contributing to the scientific record; publishing a paper allows not just a new citation bucket, but it also facilitates communication about algorithms, implementation details, and motivation. This goes beyond the intellectual contributions of the implementation; in some ways, the process of describing and publishing a paper is a process of curation, of adding metadata, and of providing context for the code. Publishing a new method paper is important.

Unfortunately, it won’t “fix” the problem we’re trying to fix. All it does it kick it down the road, so that we can deal with it again in another couple years, assuming the project makes it that far.

The same is true of many software publishing systems — publishing a chunk of code and making that the citation bucket is not going to improve the situation unless we have some method for allowing others to get a part of that citation bucket. In fact, it might even be worse at this point: having a large set of authors on a new citation bucket, all of whom are getting credit, might be more of an impediment to distributing credit to newcomers than having a relatively small number of authors. The perceived marginal gain is very small, comparatively.

A Brief Diversion into DOIs

Just about everywhere I go, people talk about DOIs. DOIs are Digital Object Identifiers; think of them as unique IDs that carry metadata. These IDs are persistent, citable, and they carry an enormous about of information with them. For instance, through some fun APIs you can even get back information about relevant datasets, if that information is part of the record.

A few dangerous topics come up when DOIs are discussed, though. For starters, be wary when someone says that a getting DOI means that whatever it points to is suddenly, magically permanent or published. The second thing is that DOIs are valuable for citations, but they are only as valuable as the method by which citations are tracked. And the third danger, which I fall into myself, is that DOIs are viewed as adding prestige to an object, rather than a future-proof reference. What DOIs certainly do, however, is to provide a record that this existed, even if it doesn’t anymore, and for that to be remembered.

As an example, if I have a DOI for a dataset (issued by DataCite) and I cite that in my paper, that citation is only tracked if both the journal and the citation tracking agency register that information. If someone cites a DOI registered in DataCite, but that DOI either doesn’t show up in Google Scholar or ADS or isn’t tracked by the journal, it might as well have been a tree falling in a lonely forest. And, how DOIs are cited and formatted in journals takes on many different forms: do they exist in the narrative, in footnotes, in the bibliography?

This is important to keep in mind: in the practical world where we need to be true to our community members, and they only care about ADS-tracked or Scholar-tracked citations, it doesn’t really help matters too much to ask folks to throw citations into a void. And as of right now, I am told that the ADS and Google Scholar do not harvest the DataCite DOI metadata and data repositories.

One important note: the metadata for DOIs at DataCite also includes items such as relatedIdentifier, which can provide semantic information about relationships to other DOIs via attributes such as IsNewVersionOf and IsPreviousVersionOf. Parsing these for a canonical set of citations to a given item would require recursive calls to the metadata resolution API, which seems unlikely. There’s further discussion of this by the FORCE11 SCWG but I believe we need to have an approach that is flatter (fewer sets of recursive linkages) and that more directly addresses the issue of person-to-software linkages, rather than paper-to-software.

We Can Aspire to More

“But,” you say, “maybe the world isn’t ideal, and we should hope that it’ll get better?”

And, I agree!

The community is changing rapidly; for instance, in astronomy the American Astronomical Society has recently issued a statement about papers relating to software. This is great news!

But let’s take it a step further. Isn’t there something we can do, that meets all of these needs? That we can add new people to? That can track citations across all time? That is recognized by the places people refer to, so that hiring and tenure committees can see it?

Kinda seems like the answer is “no,” to be honest.

So what do we do?

I’m glad you asked that! After looking through all of our options, and that serendipitous conversation with Elizabeth that I alluded to up at the top, we’ve identified a potential solution. It’s a solution that doesn’t quite pay off yet, but which seems like it will, eventually, and meet the needs we have identified. The DataCite metadata for a registered DOI includes the ability to expand the set of “contributors” for a given record, and updating the metadata is a distinct operation from updating the content. (More discussion of this notion of a group that changes over time can be found over at Dan Katz’s blog post.)

Here is my proposal for how to distribute credit to yt contributors:

  1. Create a DOI for yt through our Library. Set the creator field of the DOI record to be “The yt project” and the landing page to be the yt homepage.
  2. Provide supplemental metadata: every contributor to the code base, the documentation, community. These will be listed under the “Contributors” metadata element. While this includes the option to specify a wide variety of roles (see also Project CRediT), we will likely not utilize these. These records can include ORCID IDs.
  3. With a regular cadence, and in collaboration with the Library, update the Contributors metadata element.
  4. Strongly encourage all citations of yt to be made to this DOI. (But actually! See below!)

This isn’t perfect. The DOI will be registered with DataCite; as I noted above, DataCite isn’t currently ingested by either Scholar or ADS. There’s good reason to think that this is going to change, some day, or that there will be alternate methods for measuring impact that will parse DataCite.

There are two different fields we can utilize in the DataCite metadata to encode information about individuals that have been a part of, and should receive credit for, the software. The first is the Creator field, and the second is the Contributor field. The Creator field is what is typically thought of as the “author” field, and is typically thought of as the “author.” But for a large community project, what should this field in fact be, and how can we future proof it? I would strongly encourage this field to simply be the name of the project itself; rather than electing one or a set of representatives, it makes more sense to have it be a neutral, community-focused name. The Contributor field, on the other hand, provides a natural destination for everyone who deserves credit for having contributed. This is where we should put names.

But, this brings up odd issue that the contributors field is not thought of the same way as the creator field; does that mean that the DOI record won’t show up when a search is conducted for a person who is only listed under contributor? I don’t know. But I don’t think it’s an unreasonable hope. And it still leaves an open problem: this intentional attempt to flatten the hierarchy of contributions may dilute credit across the community. I feel that, right now, the benefits of doing so outweigh the drawbacks. And, we’re doing this manually, in collaboration with experts in the field of data curation. With the shifting ecosystem of credit and citation, this will provide us with the opportunity to collaborate, identify and potentially even migrate the DOI itself to a better or emerging environment or metadata standard.

From a personal perspective, I’m convinced this is the right way to go. It’s simply unworkable, immoral, and negligent not to more broadly distribute credit among the community members who have propelled yt to its current state.

It’s simply wrong not to take a new path, and it’s a wrongness that hurts some more than others. So let’s do it, or let’s figure something else out, but let’s not let things stay where they are.

Postscript: We Take a Few Steps Back

I wrote this post largely in isolation, following on discussions among a number of people from the community and the yt steering committee. After sharing it, and asking for feedback, the point was made to me that there’s a big distance between where we are as a community and where we want to be. And because of that, we need to compromise, and maybe hedge our bets a little bit.

So we’re adding on a bullet point: we will be writing a new method paper, and reasonably soon, and as part of a transition, we’ll be requesting dual citations of the new method paper and of the software DOI. And then, once the technology has caught up (and it is likely to take longer than a standard postdoc job application cycle) we will revisit.

One of the harshest — and most useful — pieces of feedback I got was that the ideal that I pushed for in here is actually not really going to “get what we want,” which also included pointing out that it would be incredibly easy to “game.” That’s true. It’s not going to get what we want, and it will be very easy to “game,” but I think it’s still worth trying, so I want to keep that option open, and maybe even explore it.

But for now, on to the method paper!

Thanks to Britton Smith, Elizabeth Wickes, Daniel S. Katz, August Muench, Nathan Goldbaum, Carly Strasser and Heidi Imker for their comments, suggestions and guidance on a draft of this post.