Data-Centric AI with Automunge
NeurIPS 2021 Workshop Poster
Video discussions are based on summary of the paper Tabular Engineering with Automunge, to be presented at the Data-Centric AI workshop at NeurIPS 2021. Transcript follows.
Transcript
Hi there. I’m Nicholas Teague, the founder of Automunge. We offer a python library that automates the preparation of tabular data for machine learning.
The interface is channeled through two master functions. The automunge(.) function accepts raw tabular data and translates to numeric encodings with infill to missing data. Along with encoded dataframes, automunge(.) also returns a dictionary recording parameters and steps of derivations.
This dictionary can be used as a key to prepare additional corresponding data on a consistent basis. In general use, we expect the data returned from automunge(.) will be used to train and validate a model, and the data returned from postmunge(.) will be used for inference.
Under automation, features are evaluated to select between types of encodings. Numeric data is normalized, Categoric is binarized, High Cardinality is hashed, Datetime data is segregated by time scales, and Missing Data has ML infill and missing marker aggregation.
Automunge may also serve as a platform for engineering data pipelines. An internal library of encodings includes options like parsed categoric encodings and noise injections, which can be mixed with custom defined operations in sets of derivations targeting a feature. Inversion can recover the form of input.
The Automunge family tree primitives can be used for command line specification of transformation sets including generations and branches of derivations. Entries to upstream primitives are applied as the first generation of derivations, and downstream primitives are treated as upstream primitives for the next generation.
Our paper demonstrated a new feature in the library for categoric consolidations of label sets. By consolidating multiple labels into a single representation with automunge(.), a single classification model can be trained, and then after inference a postmunge(.) inversion can recover the original multiple label form.
More information, including links to documentation, tutorials, and essays, is available at automunge.com.
For further readings please check out A Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com
* images copyright 2021