Getting Caught Up

This is what rapid progress looks like

Published in

Automunge

14 min readDec 4, 2019

For those that haven’t been following along, I’ve been using this forum in recent months to document the development of Automunge, an open source platform for the preparation of tabular data for machine learning. The tool is a Python library available now for pip install, and serves a really useful function of automation for the final steps of data wrangling prior to the application of machine learning. One of the key sources of complexity in implementing any machine learning project is the management of the data pipelines for preprocessing of training data and consistent preprocessing for additional data such as to generate predictions. Automunge fully automates this pipeline such that a user needs only pass consistently formatted dataframes or arrays and Automunge handles all the rest. Through the application of the tool, numerical data is normalized, categoric data is numerically encoded, time series data is encoded - all based on categories of transformations inferred from evaluating properties of the data. A user doesn’t have to defer to automation, she can also selectively assign custom processing methods to distinct columns, which methods may be single transformations or even sets of transformations such as with generations and branches defined using our “family tree” primitives. The composition of these transformations functions may be sourced from our quite impressive internal library, or alternately be custom defined by the user, making use of just a few simple data structures, for incorporation into the platform. If you’d like an introduction to basic operation of the tool, a good starting point is an essay I published a few weeks ago:

An intro to Automunge

The stuff I couldn’t fit in a tweet

medium.com

By implementing feature engineering transformations within the platform, a user get’s access to a whole host of useful built-in methods, which I won’t go into full detail here but just to give you a flavor of a few of our features:

Numerical encoding: columns without user specified sets of transformation functions are deferred to automated evaluation of data properties for application of simple feature engineering functions to transform raw data into a form suitable for direct application of machine learning, including normalization of numerical data and boolean encoding of categoric data.
Library of transformations: Oh and just this like really impressive catalog of transformation functions that may be assigned as a form of feature engineering and numerical encoding to various kinds of data such as for tailored feature engineering methods to your heart’s desire.
Tree of transformations: a user may pass custom defined sets of transformation functions for assignment to distance columns, such as to apply a “tree” of transformations including generations and branches.
Feature Importance and Drift Report: the functions have automated methods to conduct evaluations such as feature importance of source columns and derived columns using the shuffle permutation method, as well as evaluation of data set properties for comparison between original training data prepared with the automunge(.) function and subsequent data prepared with the postmunge(.) function to identify any drift in data set properties.
LabelSmoothing: the regularization by conversion of label sets from boolean 1/0 encodings to some mix of floats between 0–1 such as for reducing the label certainty presented to our cost function, which may benefit training in presence of noisy mislabeling in some portion of labels, which in many cases is inevitable.

Anyhoo I’m going try to be brief here, wanted to quickly highlight a few of the new methods rolled out in recent weeks in the interest of getting caught up on update disclosures.

1. Default categoric encoding:

We recently updated our default encoding for categoric data from one-hot encoding (eg ‘text’ in our library) to binary encoding (eg ‘1010’ in our library). As a quick refresh, one-hot encoding refers to generating for each category entry a unique Boolean column to indicate activations, and binary encoding refers to the encoding into a set of columns wherein a category may be indicated by one or more simultaneous activations in the same row. For example an activation for a 6 column set in one-hot encoding may look like 000100, while in binary encoding it may look like e.g. 110 (where each number represents a distinct column). The rationale here is that the binary encoding is much more memory efficient than one-hot, as one-hot encodes into the number of columns n based on the number of categories n, and binary encodes into the number of columns n based on the number of categories 2^n. Oh and for now only changing the default in the training data, the default for label sets will remain one-hot encoding unless otherwise specified by user.

a few of our options for categoric set encoding

One of the challenges of incorporating this default binary category encoding was to maintain support of our ML infill function, which as a reminder trains a unique machine learning model for each set of derived columns for purposes of predicting infill to missing or improperly formatted data based on properties of the rest of the data (in a fully generalized and automated fashion). Diving into the implementation weeds for a second, one of the key steps here is assembling subsets of data to serve as training data, labels, and features for generating predictions. The difficulty comes in when you consider that modern machine learning libraries sort of require either an ordinal encoding or one-hot encoding for categoric labels, such binary encoding as developed here aren’t typically supported (by anyone I don’t think but haven’t done a lot of digging so I could be mistaken). Our solution was simply to perform a conversion of the binary encoded labels prior to training our predictive model, or more specifically converting the labels from binary to one-hot encoding prior to training the associated ML infill predictive model, and then after generating our predictions for infill converting the one-hot encoding back to binary encoding for insertion. I mean in practice it’s kind of a trade-off between overall memory efficiency of encoding categoric data in binary and processing time for the conversions in ML infill. If some enterprising young researcher wants to conduct a study I’d be happy to help, I mean the code is all on GitHub it’s a pretty open ecosystem.

2. Passing column-specific parameters to transformation functions:

We’ve introduced another really useful extension which greatly enhances the capacity of the platform to efficiently conduct feature engineering methods tailored to properties of the data. Whereas previously any variations on a transformation function (such as feature engineering transformation functions that may be either pre-defined in the internal library or user-defined and passed to an automunge call) — sorry I digress, whereas previously any variations on a transformation function intended for a distinct column which was not already available in the library would require defining an all new transformation function incorporating the variations, a user now has the ability to pass parameters to those transformation functions applied to distinct columns, or even set new parameters as a default for all columns, in an automunge(.) call via the ‘assignparam’ dictionary. Although currently our library of transformations does not have extensive support for parameter passed variations, the intent is to build in some more methods for these kind of variations throughout the library. Of course if some variation to a transformation function is useful or common enough it makes more sense to just create a new function available for assignment so that a user doesn’t have to think about parameters, kind of a trade-off between a user’s ability to navigate our continuously growing catalog of feature engineering methods vs simplicity of application for common transformations.

assignparam assembly demonstration

The specifics of implementation are fairly straight-forward but I’ll just quickly walk through in the interest of thoroughness. In short, the dictionary should be populated such that the first layer keys are the transformation category for which parameters are intended (each transformation category has a corresponding string identifier). The second layer keys are string identifiers for the columns for which the parameters are intended (for example if data is a pandas dataframe this would be the column header, or if data is a numpy array this could be an integer for column position number). The third layer keys are the parameters whose values are to be passed. The parameters will only be passed to those columns assigned in the second layer keys, otherwise the transformation functions will make use of some default parameter. If a user wishes to assign new default parameters for a given category of transformation, they can do so by passing an entry in the first layer of assignparam keys with the string ‘default_assignparam’, which then accepts second layer keys of transformation categories and third layer keys of desired new default parameters. (In the implementation this new default is then populated for every column in the form of the other assignparam entries shown FYI.)

3. A few new methods for preparing categoric data:

So we’ve been rolling out some really cool methods lately for extracting structure from categorical string sets by way of evaluation for overlaps between string compositions — using our ‘splt’ family of transformations. (As a quick asterisk these methods aren’t really intended for long form text, as they basically sequentially parse within embedded loops to identify overlaps in character compositions. However if you’ve got a bounded range of unique values in a categoric set, this can prove useful for identifying internal grammatical structures.) A few deviations include those that identify overlaps by way of an activation column for each train set identified overlap (such as splt/spl8), those that create a new column in which values with overlaps present are replaced with the (shorter length) overlap (such as spl2/spl9), or another deviation comparable to last but those values that are not replaced with overlaps are reset to zero (such as spl5/sp10) — with a key distinction being that in the cases of splt/spl2/spl5 the test set values are consistently parsed to the train set values to identify overlaps, and in spl8/spl9/sp10 the test set processing makes an assumption for efficiency that the set of unique values in the test set is the same or a subset of those values which were found in the train set. And using these functions we’ve assembled a few defined sets intended as an improvement for raw categoric embedding of bounded string sets of unknown structure by way of extracting grammatical patterns and encoding them for inference by the predictive algorithms, such that embeddings are based on properties of data found in the train set and consistently encoded for subsequent test set data.

An example of transformation trees for categoric sets, for more variations please see our READ ME. The orange boxes represent the internal derivations and the blue boxes the returned outputs. I probably overuse this color scheme, you know Florida Gators it’s kind of a thing.

I’ll demonstrate here what we’ve come up with in the family trees defined in our library for root categories or19 and or20. So basically if you want to apply these transforms, you can assign a column to the root category in the “assigncat” dictionary passed to an automunge(.) call. And the result will be a series of returned sets originating form that source column, with transforms based on property of data found in the train set, consistently applied for subsequent data in designated ‘test’ sets passed to automunge(.) or postmunge(.). As shown in the image above, the transform starts by converting all strings to uppercase characters via the UPCS function, which is based on the assumption that differences in string cases can be ignored (note there are also alternates of these methods in the library that omit the UPCS precursor). The set of distinct uppercase values are binary encoded via the ‘1010’ transform into a set of 1/0 designated columns. The ‘nmc8’ transform is another kind of string parsing in which unique values are parsed to extract any numeric character entries such as to return a numeric set, which in this case is followed by a z-score normalization via the ‘nmbr’ category. And then the string parsing for overlap methods are shown here making use of ‘spl9’ and ‘sp10’, wherein or19 has two tiers of overlap parsing, and or20 has three tiers of overlap parsing, in each case the parsing followed by an ordinal (integer) encoding with the ‘ord3’ transform which sorts entries by frequency of occurrence. The output would be distinct columns for each of the ord3 and nmbr encodings, and then the number of columns returned for the 1010 encoding would be a function of the number of distinct values found in the original set (after conversion to UPPERCASE).

4. New report options returned from postmunge(.)

Ok just to kind of rehash the fundamentals of how this stuff works, we have two primary functions built into the class for processing data. The automunge(.) function is intended for the initial transformations applied to training data intended to train a machine learning model, and if available, simultaneously any corresponding data intended to generate predictions from that model (aka the train and test sets). Through application a series of prepared (numerically encoded) sets are returned including training sets, separately processed validation sets such as to avoid data leakage, and any processed test data — each of these sets including data associated with training, index, and/or labels. In addition to a few other pieces, the automunge(.) function also returns a dictionary (the “postprocess_dict”) capturing all of the information about the steps of transformation, normalization parameters, trained ML infill predictive models, etc. needed to consistently prepare additional data, such as by passing this postprocess_dict along with additional data to the postmunge(.) function.

demonstation of automunge(.) and postmunge(.) calls with default parameters

So what I wanted to highlight is that we’ve updated part of the returned sets from postmunge to now include a returned dictionary called “postreports_dict”, which contain the results of any optional assembly of postmunge reports for feature importance evaluations or drift report. As a refresh the feature importance evaluations are conducted by way of the shuffle permutation method and require the inclusion of a designated labels column. Currently have two primary metrics for the results, the first indicating importance of the source column toward predictive accuracy, and the second indicating the relative importance of each of the derived columns originating from the same source column towards predictive accuracy. And then we have the optional assembly of a “drift report”, which evaluates properties of data passed to postmunge(.) for comparison to original properties of corresponding training data that was passed to automunge(.), such as for instance may prove useful to evaluate consistency in composition or distribution properties of subsequent data (as deviations could indicate that it might be time to retrain a model). Yeah so these returned reports are accessible via printouts during the postmunge operation or in the postreports_dict object returned from postmunge. Good times.

5. Label Smoothing

Finally wanted to quickly highlight a recent addition to our library giving options for a user to conduct Label Smoothing, a kind of regularization tactic in which categoric label sets are converted from boolean activations 1/0 (such as in a one-hot encoding) to some mixed threshold of reduced/increased threshold — for example passing the float 0.9 to one of the Label Smoothing parameters would result in the conversion from the set of activations {1,0} to {0.9, #}, where # is a function of the number of categories in the label set — for example for a two label set it would convert the activations {1,0} to {0.9,0.1}, or for the one-hot encoding of a three label set it would be convert {1,0} to {0.9,0.05}. We’ve given user options to apply label smoothing to any or all of the returned sets such as those labels corresponding to train, validation, for test data passed to automunge(.) or subsequent test data passed to postmunge(.). I’ve seen a few people cite the paper “Rethinking the Inception Architecture for Computer Vision” by Szegedy et al for this practice, but actually was doing some digging and according to Goodfellow et al’s Deep Learning text this practice has been around since at least the 1980’s, so all I’m really doing here is packaging to simplify the method (with a little help from Stack Overflow).

demonstration of Label Smoothing application

Some of the (minor) complexities associated with the implementation arose from asking the question of whether label smoothing when applied should be applicable to just training data vs corresponding validation or test data. I couldn’t find any literature on this point, but my intuition is telling me that really label smoothing is intended primarily for the training operation, and when it’s time to validate our data or generate predictions we prefer the boolean encodings for comparison to predictions. However, because the sets that we designate as train vs test etc may be at times put to different use (such as for example using the automunge(.) call on a subset of the training data to populate a postprocess_dict and then processing additional training data more efficiently in postmunge(.) wihout the overhead of the evaluation functions — sorry long tangent), well in other words what is passed as test data may actually be intended as training or validation data or etc, so wanted to give the user options to conduct label smoothing on any or all of these sets. Long story short (too late) we have separate parameters for applying label smoothing to train, test, or validation sets in automunge(.) (with parameters LabelSmoothing_train / LabelSmoothing_test / LabelSmoothing_val) or for additional data processed in postmunge(.) (with parameter LabelSmoothing). A user can pass the default of False for no label smoothing, a float in range 0–1 for the new activation value, or can pass True to conduct any label smoothing on these sets to be applied consistently to whatever value was passed for the train set. Sorry occurs to me this was a little more long-winded than was intended, just trying to be thorough.

Conclusion

Well it’s probably a little unorthodox to conduct software updates at quite the pace that Automunge has been maintaining for the last few months, I mean 3–4 updates a week is certainly kind of extreme. We are blessed to not have very many users at this point, which gives us a little more flexibility for rapid iterations as we hone in on product market fit. Of course this can’t go on indefinitely, and one of the triggers for dampening this pace down will certainly be as we get more people on board to incorporate into their workflow. We’re currently at the point where we are very very interested in feedback from early users. What here do you like, where might our implementations or documentation be causing any confusion? Are there any features or transformation functions that you’d like to see built into our library? Really even the slightest bit of acknowledgement that you’ve gotten some value here would be beneficial as it can help us identify channels for further targeted refinement. Really appreciate any and all retweets and shares.

Great so next week going to be pretty busy as I’ll be attending the NeurIPS conference in Vancouver. If any readers might like to say hello please feel free to look me up on twitter or Whova or something under Nicholas Teague. Thanks sincerely for your attention and opportunity to share. Cheers and God bless.