Documenting Zelig

This post refers to Zelig version 5.0-13

In a previous post I discussed the goal to improve Zelig’s tests. A related goal that I’m also focusing on is improve Zelig’s documentation. Clear documentation enables:

  • users to easily use Zelig in their work.
  • maintainers to more quickly understand and so maintain and improve the source code, e.g. the Rule of Robustness: “robustness is the child of transparency and simplicity.”
  • better unit and integration tests. Documentation specifies expected software behaviour that can guide the development of unit and integration tests. Also, documentation built using the literate programming paradigm can itself test software behaviour when compiled.

Dueling Zelig 4 and 5 documentation

Zelig is currently documented in a number of ways. The general Zelig approach is documented in journal articles, including those HERE and HERE. Zelig syntax is documented using the package’s internal documentation (also available on CRAN) and a website.

Currently, these sources provide “dueling documentation” between Zelig versions 4 and 5 syntax that may confuse some users.

Zelig 5 (released in 2015) introduced a substantial change to Zelig’s syntax and behaviour. The package now relies on R’s new (as of R version 2.12) Reference Classes (RC). RC have a number of key advantages for code development and readability. Reference classes use conventions that would be familiar to users of other object oriented programming languages such as Python.

The following code gives a sense of how Zelig uses reference classes to run a simple regression on R’s included swiss data set:

Though the general steps are the same — estimate, set fitted values, simulate, plot — , the Zelig 5 RC syntax is different than previous versions. You probably noticed that RC objects are not overwritten by the assignment operator <- but are instead mutated by it.

My suspicion is that most R users are unfamiliar with Reference Classes as, in my experience, not many packages use them, at least not at the user interface level. Therefore documentation, with lots of examples and explanation with Zelig’s Reference Class behaviour, is crucial for most people to be able to use the package.

Zelig 5 also has wrappers for its key functions that allow users to continue following Zelig 4 syntax. For example, a similar operation to the one we just saw using Zelig’s wrappers that imitate version 4’s syntax would be written:

Using Reference Classes enables code development and readability improvements and having Zelig 4 wrappers enables compatibility with code written in previous versions. However, this state of affairs creates documentation challenges.

When users attempt to access Zelig’s help files on functions such as zelig or sim using R’s question function (?) they are typically directed to the documentation for the Zelig 4 wrapper. To access the Zelig 5 documentation they must invoke the help method on their RC object, e.g. zls$help(). This will launch the user’s web browser and take them to the relevant documentation on the Zelig website. This difference could be confusing.

While the Zelig 4 wrappers and Zelig 5 methods have broadly similar functionality, they do occasionally differ in non-obvious ways. For example, notice in Zelig 4 wrapper code above the x and x1 fitted values are set within the same setx call, whereas there are separate Zelig 5 methods for these. The functionality of these two approaches also differs as Zelig 4 setx will accept a range of values for a variable, while you need the setrange method in Zelig 5 syntax.

Ideally, these different behaviours would be more clearly documented and accessible to users with little knowledge of Reference Classes.

Updated Documentation Toolkit

Zelig currently uses literate programming in the form of the Sphinx Python package and reStructuredText to create its online documentation. There are two ways that this could be improved.

The first is to convert to using R Markdown, roxygen2, and pkgdown. Zelig already uses roxygen2 to generate its internal documentation, though this contains more information about Zelig 4 and differs from the Zelig 5-focused online documentation. At the time the Zelig documentation was being developed, R Markdown was in its infancy and it was unclear how successful it would be. In the time since then, these tools have been widely embraced by the R developer community and they are fairly full featured.

pkgdown and R Markdown will enable:

  • Close integration with the internal documentation, which is currently different from the online documentation in ways that are potentially confusing.
  • Closer test integration as the online docs could be tested every time Zelig goes through a build check. This would enable more accurate documentation (e.g. docs that don’t include errors) and provide another “test surface” for finding errors.
  • Simplify the package development toolchain.
  • Easier community contributions as Markdown is now widely used in the R community.

Documentation Build Error Handling

An important part of this toolchain changes is to enable more robust documentation. Currently, when documentation is built by Sphinx. If there were errors the build would not break. Instead error messages are outputted into the documentation. For example:

Zelig Documentation for probitsurvey as of 2017–01–16

Needless to say, this does not produce useful documentation. It also did not alert package maintainers of possible broken code that needed to be fixed.

While converting the documentation to R Markdown, I’m adding the knitr global option:

This causes the build to fail if any of the code chunks return an error.

Dynamically Built README file

Some potential Zelig users will likely encounter the package via the GitHub site. This could be a good place to inform users of crucial information, especially the new Zelig 5 syntax, and provide a quick start example. To ensure that the example runs without problems, we could also add knitr::opts_knit$set(stop_on_error = 2L).

Faster Crashes, More Informative Error Messages

Developing a fuller test suite will help us identify more errors that users might trigger. When users are given an error, the error messages should be informative about what caused the problem and how to resolve it.

Error messages in Zelig good be improved on these points. For example, if you want to use Zelig 4 syntax, but forget to specify the model type then in version 5.0–13 you would receive the error:

Error in models4[[model]] : invalid subscript type ‘symbol’.

A more informative message would be something like:

Estimation model type not specified.
Select estimation model type with the model argument.

A related issue with the Reference Class Zelig implementation is that no error message is produced when a user skips or does not successfully complete a step in the estimate, set, simulate, plot Zelig workflow. For example, the following code is what we saw above, but omits the setx step that sets fitted values needed to find meaningful quantities of interest from the estimation model.

No messages or errors are returned until the final line is called. The returned warning is:

Warning message:
In par(old.par) : calling par(new=TRUE) with no plot

There are two issues here:

  • Zelig did not fail early enough.
  • Zelig did not fail informatively.

A user might believe their problem is in the graph method and spend time trying to decipher the oblique error message. When in fact their sim call should have returned an informative error, preventing them from progressing to the graph call. If a user does not supply required information then the Zelig workflow should fail immediately and provide an informative warning or error message.

NEWS

Many R packages include a plain text file called NEWS that documents in detail all of the changes made in a given release. These files are include on the package’s CRAN page. Currently Zelig lacks such a file. It would be useful to add a NEWS file to update users about changes, especially breaking changes, made in Zelig releases.