Validations with Automunge
Stable version release 8.30
Looking back, the Automunge development process I think was somewhat unique in the field for the sheer pace of iterative developments throughout the multi-year span of focus. It is not unsurprising that in this context the interest of a user base was not exactly what I would consider as compounding in the exponential sense. After all anyone who wanted to take a swing would need to identify a starting point, not to mention following along to evaluate whether subsequent iterations were worthy of update. Now that the library has reached what I would consider a stable and robust implementation, I wanted to use this essay to quickly highlight some of the validations that have gone into our assessment of that qualification.
Obviously the complexity of various import dependency updates is one of the primary challenges for ongoing maintenance of a library. In general, one can expect that most updates will have backward compatibility. The general convention for most libraries is that the updates with bug fixes are most frequent and are associated with iterating the third number in a version number i.e. x.x.(x+1), less frequent are updates with new functionality, which generally iterate the 2nd number and reset the third to zero i.e. x.(x+1).0. Generally the backward compatibility breaking updates are reserved for iterating the first version number i.e. (x+1).0.0. A little bit more minutia is that most libraries when rolling out a version of new functionality will continue to support prior functionality paths such that you may later see simultaneous branch updates to e.g. both 3.8.(x+1) and 3.7.(x+1). Usually there will be some intended obsolesce date for each branch, which date may be extended in cases of continued popularity of a flavor. For example Python version 2 maintained support well beyond its original intended obsolesce date.
The Colaboratory platform for cloud based jupyter notebooks takes the wise approach of standardizing on a more stable fixed version default installs over prolonged periods, even at times independent of ongoing updates to a library. (They currently have a basis of Python 3.7.13, Pandas 1.3.5, and Numpy 1.21.6). In practice, most productionized rollouts of an application in industry will be implemented in a static environment over prolonged periods without updates, like would be hosted in a docker container. Thus a static use case need not worry about ongoing inter library dependencies, they just need to establish a fixed basis.
A library maintainer on the other hand has a somewhat not insignificant additional burden of keeping an eye on various configuration scenarios as may be deployed by various users. Automunge has attempted to keep this scope manageable first by keeping imports to a bare minimum, primarily Pandas, Numpy, and Scikit, each as are very mature and stable libraries without expected future dramatic changes to their api. As a general practice when significant updates are rolled out (or when our own library updates are rolled out) we have relied on our validation notebooks to ensure maintained functionality.
The scope of the validation notebooks are broadly based on an artificial tiny data set (dataframe) that seeks to be representative of various data type and missing data scenarios that may be found in the columns of live tabular data sets. The primary validation can be performed by cycling through each dataframe column configuration with each transform in the library, and comparing the output of train and test data as prepared in
automunge(.) as well as comparing the train data to test data prepared in
postmunge(.). This type of comprehensive cycling is then repeated in a scenario of activating most supplemental options as well as comparing input data to the result of an inversion, where in both cases validation is based on the premise of edge cases resulting in either the presence of NaN points (missing data) or otherwise halting the operation of the functions. (A limitation of this approach is associated with elements of stochasticity that may interfere with train / test comparisons, which we address by the simple tactic of deactivating stochastic elements.)
Adjacent to these type of frequent validations performed with software rollouts, we have also intermittently performed various forms of benchmarking operations, see for example the paper Feature Encodings for Gradient Boosting with Automunge, as well as most of the academic works shared to arXiv by this author. We also have taken an iterative approach to documentation adjustments to the comprehensive README, with each software update being performed in conjunction with directly related tweaks to documentation. We thus feel we have some comfort with the robustness of implementation and voracity of documentation.
Now that the library has reached a stable point, we are a little unsure how to proceed. We want to demonstrate some level of stability to potential users, and recognize that the frequent iterative update cycles demonstrated previously are probably counterproductive to that goal, and so we are maintaining version 8.30 as the stable release for foreseeable future. I offer here the validation notebook in all of its unpolished mass of glory in the hope that it may be a potential source of comfort to any scrutinizing party considering whether to experiment with this library in the context of a mission critical application or otherwise. Additional resources, such as documentation and tutorial notebooks, are available on GitHub.