Automunge
Published in

Automunge

Thank you for your feedback

In which we argue our case

A few excerpts from discussion with reviewers, sharing for transparency purposes:

To: Reviewer 1

Contributions

I appreciate that you offered two specific criteria for software packages, I believe this software has met both of these criteria as follows:

Criteria one: “The software implements a scientifically novel algorithm, framework, model, etc.” I believe the family tree primitives as described in Figure 6 meet this criteria, for the reason that they have formalized a fundamental aspect of processing tabular data, as enabling a simple means for command line specification of multi-transform sets that may include generations and branches of derivations. I believe the family tree primitives to be somewhat fundamental, and any data transformation sets as may be applied to a single feature set of origination (as would be found in a “tidy data” set) can be universally expressed by way of these simple and novel primitives applied by way of recursion.

More particularly for string parsing considerations, the paper does not just introduce string parsing, it offers a comprehensive overview of the various permutations that may be applied for this purpose. We sought for this treatment of string parsing methods to be exhaustive. We believe that string parsing is appropriate for a machine learning conference because of just how fundamental is the application for tabular data applications of machine learning, which in practice is generally comprised of just two broad categories of feature set types — numeric and categoric. We have introduced a novel automated approach for encoding tabular categoric features.

Although the benefit of string parsing is expected to vary based on esoteric characteristics of target feature sets, the paper operated on the premise of a self-evident benefit to ML for improved information retention of extracting grammatical structure that may be shared between categoric entries for presentation to a training operation in comparison to coarse-grained representations. That being said I am working now on an additional demonstration jupyter notebook to be uploaded to the supplemental material and will advise when it is ready in which I intend to experimentally demonstrate the benefit as you suggested.

Criteria two: “the software package is so complex that a well-designed implementation in itself is of scientific significance.” I believe the simplicity of the package is deceptive for the amount of complexity that is abstracted away. I recently attended a data science conference at a high profile university where the keynote speaker described a project to apply machine learning to predict missing data infill for a specific tabular data application in industry. Automunge offers a generalized solution and abstracts away all of the complexities for any tabular data application. It is a push-button autoML solution for missing data infill, and all of the string parsing methods demonstrated have built in support.

One of the most useful abstractions for purposes of hiding complexity is the manner in which the application of automunge(.) populates a python dictionary “fit” to properties of the train set, capturing all of the steps and parameters of transformations, such that for subsequent data, including streams of data for inference, consistent preparations may be applied quickly and efficiently in the postmunge(.) function with only the prerequisite of passing this dictionary. This practice of basing properties of transformations explicitly on properties from a designated train set is an improvement on what is still common in mainstream practice to normalize train/test/validation sets separately — which introduces issues of potential stochastic inconsistency and data leakage. We noted too in the Broader Impacts appendix that the ability of researchers to publish these populated dictionaries could benefit reproducibility of benchmarks and experiments.

Thank you for the recognition that you believe this library would be useful. That is our goal.

With respect the the “cons” that you noted:

Regarding novelty: we believe the push button automation of string parsing operations to be novel. We believe the integration of command line specification for multi-transform sets, autoML missing data infill, and various other features of the library to be novel. We believe the family tree primitives to be novel and a particularly useful fundamental reframing of specifying transformation sets via recursion.

I hope you will forgive my writing style, you noted that it was not clearly written. Partly this was associated with trying to cover a lot of ground I suspect. I believe the family tree primitives description benefits by taking into account the demonstrations of Figures 4, 5, 6, and 7 which illustrate their application in practice for the given ‘or19’ root category example.

Thank you again for your review. Happy to answer further questions. I hope you might reconsider your rating based on this feedback. If you are unsure please consider reviewing my response to the other two reviewers for context. Best regards.

To: Reviewer 4

ICLR applicability

Thank you reviewer for your interpretation of this work. I am reading the primary consideration of your review to be based on the premise that Automunge is off-topic for the ICLR conference. I will present here a short summary of why I do not believe that to be the case.

One of the papers we cited was “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al from ICLR 2013, which rolled out the Word2Vec method for vocabulary vectorization, a precursor to the types of contextual vocabulary embeddings applied in NLP applications like GPT-3. Quoting Gary Marcus in his recent book Rebooting AI, “Rumors of the replacement of feature engineering have been somewhat exaggerated; the hard work that goes into crafting representations like Word2Vec still counts as feature engineering, just of a different sort…”. Even if string parsing is a type of feature engineering, so is Word2Vec. More particularly, we noted in our paper that vocabulary vectorization of the like of Word2Vec is inaccessible for certain types of categoric features that may be found in a tabular data set. We gave two examples of addresses and serial numbers in which language models trained on public text corpus may be insufficient for application to esoteric domains, especially considering that in tabular data sets categoric feature sets lack the context of a surrounding text corpus that may enable a fine-tuning of a language model. These type of scenarios are not uncommon in real-world tabular data sets.

The premise of our work is that in a tabular data application, models will often benefit by improving the information retention of categoric encodings by extracting grammatical structure shared between categoric entries as opposed to coarse-grained encodings of mainstream methods.

We believe that the string parsing operation to be beneficial both as a general purpose supplement to categoric encodings to less advanced users, but also useful for sophisticated users who might otherwise consider applying advanced NLP methods like BERT to tabular applications who recognize that there are some types of esoteric domains where BERT may not be as viable. Even for next generation technologies we are seeing from the likes of GPT-3, having a formal framework for tabular data processing will still be necessary — if we have a NLP model that can write software in python, that doesn’t mean we don’t need python. Thus Automunge is infrastructure that could be built on top of with NLP applications. A foundation.

I appreciate your review of this work. I hope you might consider reading my responses to the other reviewers as well before making your final decision. Best regards.

To: Reviewer 2

Further clarifications

Thank you reviewer for your comments on this work. I appreciate your recognition of the benefit of automating repetitive tasks that data analysts, especially in NLP domain, deal with. Thank you for confirming that feature selection and string encoding will be beneficial for applied researchers with textual data.

I hope you are right about potential for a wide interest in the community, one of the consistent challenges for the library developers has been getting the word out, as Automunge is an open source library without resources for “public relations” or advertising. I believe that this library could be of broad benefit to both machine learning researchers and practitioners, as tabular data preprocessing conventions have yet to hone in on a single mainstream standard.

With regards to a literature review and comparison to existing and relevant work, this is partly a result of research being conducted by invention and building things from scratch (primarily on top of the Pandas library). The seed of this project originated from some beginner tabular data competitions on Kaggle, and the developers basically took the approach of building what was seen as an unmet need for a tabular data standard from the ground up. The development process has been incremental and evolutionary, and along the way I believe some material improvements have been incorporated over what is otherwise available in mainstream practice for tabular data preprocessing.

With respect to performance plots for different tasks, I am taking this as a good idea for further research. One of the drivers for various design decisions has been speed of application, particularly for subsequent data processed in the postmunge(.) function. Part of the challenge for benchmarking purposes is budgetary constraints associated with expensive licenses for commercial alternatives. We did include some benchmarking for speed in the Jupyter notebooks uploaded with the supplemental material (see for example the uploaded notebook “efficiency_tests_061020.ipynb”).

I believe one of the key advantage of this framework over a mainstream option like scikit-learn to originate from the simplicity of populating a single dictionary “fit” to properties of a training set which can be shared and published by researchers for fully consistent processing of additional data — as may benefit reproducibility of benchmarks and experiments. Other aspects of novelty such as these string parsing methods and automated ML for missing data infill are also a material improvement.

With regards to the use of the term “string theory”, I hope you will grant this author a small indulgence. It’s use is admittedly a little on the humorous side, we justified that based on an assumption that there was not likely to be any confusion about the overlap between these very very different domains, especially in the context of the phrase “Parsed Categoric Encodings”. If you consider it beneficial for moving forward we would certainly be willing to strike those two words from the title.

Again I certainly appreciate your review and recognition of the potential benefit of this library to applied research. I have also publicly responded to the other reviewers, and have taken an action to upload an additional validation demonstration notebook which should follow in the next few days. Best regards.

Mozart’s Sonata VI — Nicholas Teague

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nicholas Teague

Nicholas Teague

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.