Breaking with the past: making backwards-incompatible changes in the RDKit

Note: as usual, I’m looking for feedback on this document. Please get in touch if you have thoughts, tips, or concerns.

Background

Over the years that the RDKit has been around, I have placed a strong emphasis on maintaining backwards compatibility. Things have been added to the API, but not much has been changed or taken away. With a few exceptions, code that worked 10 years ago will still (modulo bug fixes) work today. This approach has some obvious benefits for RDKit users and those who build software on top of it, but there is a cost: we’re stuck with the consequences of bad (or, perhaps, poorly thought through?) design decisions I made 15 years ago.

A concrete example of this is the way implicit hydrogens (those not actually present in the graph) are handled in the toolkit. The meaning of the methods `Atom.GetNumImplicitHs()` and `Atom.GetNumExplicitHs()` is a recurring source of confusion and, possibly, error. There’s really no good reason to have that second method, which usually returns the number of (implicit) hydrogens that were specified inside square brackets for an atom. That number might be needed within the SMILES parser, but it’s likely not useful to others.

Another example is the handling of double bond stereochemistry. The method `Bond.GetStereo()` is only useful if the full CIP perception has been carried out (an expensive operation). Contrast this with atoms, where `Atom.GetChiralTag()` returns useful information (assuming you know how to interpret it) without additional work being required. On top of this, whether the double bond is `E` or `Z` is only useful information if you know the relative CIP ranks of the atoms bonded to either end. Doing anything useful in terms of geometry requires using the bond directionality flags on neighboring single bonds. It’s confusing and creates additional work for developers.

These are just two examples that have come up recently, there are others — one way to find some of them is to search the `RDKit-discuss` mailing list for the string “historical reasons”. :-S

I’d like to start cleaning up some of these problems, but there’s really no way to do it without potentially breaking other peoples’ code. I’m normally extremely reluctant to do this, but it’s the only way to move the toolkit forward and fix these things. The question is how to do it.

The good news, I think, is that having a long-term release (LTR) available really helps here. This allows users to get bug fixes without having to worry about new features or backwards-incompatible changes. I was concerned about having to maintain and support a bunch of backwards compatibility branches, but it’s probably fair to not do that as long as there’s an LTR around.

Concrete Proposal

We will open the door to start this as part of the 2017.03 release. Due to the other changes happening to allow modern C++ (see the corresponding post), we may end up too busy to actually get anything real done as part of that release, but we will certainly finalize the process for making backwards incompatible changes. No matter when it starts, this is going to be a deliberate process; this probably scares me more than it scares anyone reading this post. ;-)

Things to do

  • Introduce a Github tag for changes that break APIs. Can be applied to PRs or issues so that people can get advance warning. Facilitates automation of the release note changes.
  • Add a new section to the release notes detailing things that broke along with documentation explaining how to identify/fix broken code.
  • Add a variable and the appropriate preproccesor definitions containing the API version in C++ (accessible from Python/Java/etc) that indicates breakage with major number increases. Start with 1.0 (today), normally just increment the minor version on a new release, increment major version when APIs change. I don’t think we need a patch version in the version number, but we should think about that.
  • Decide whether or not to put warnings in the code for deprecations (when there is a deprecation and something doesn’t just change). This seems like a good idea, but given how often people just ignore warnings, I’m not sure it’s actually worth the trouble.
Show your support

Clapping shows how much you appreciated Greg Landrum’s story.