A Disclosure of Sorts

String parsing for the win

Nicholas Teague

Published in

Automunge

11 min readNov 9, 2019

Unique Forms of Continuity in Space — Umberto Bocciono (1913), on display at New Orleans Museum of Art

U2 — Achtung Baby (album)

Ok have been recently rolling out some methods for automated string parsing in tabular data and am going to quickly throw together a disclosure of sorts for the methods. Kind of rushing this one since trying to get submitted to the patent office on quick turnaround to secure precedence for provisional patent aka patent pending. Let’s give her the proverbial shot as they say. (Patent pending)

Numerical String Parsing Methods

The ‘nmrc’ transform takes as input a tabular data set (such as a pandas dataframe) and evaluates a column containing categorical entries (such as a set of strings) and returns a new column containing numerical (float) entries for any subset of string characters that were evaluated to be a collection of numeric characters. The method gives precedence to numbers of longer character length, such that if the string contains more than one set of numeric characters the group of the longest length set will be returned. The method may be applied to just a training data set in the automunge(.) function, both a training and test data set simultaneously in the automunge(.) function, or just a test data set in the postmunge(.) function. An alternate to the ‘nmrc’ transform is the ‘nmcm’ transform which is identical save for the evaluation of string character groupings strips out comma characters such as to recognize collections of numeric characters with comma separators as is otherwise a limitation of python.

The parsing process is as follows (this can be applied to either the train or the test set independently):

An alternate method to nmrc we call nmr4 (or with similar differences an alternate method to nmcm we call nmc4) which is more efficient in application to the test set by way of replacing the string parsing of the test set based on the assumption that the set of unique values found in the test set is the same or a subset of those unique values in the train set. Thus we can simply evaluate for any cases of unique entries that did not meet this assumption (based on a comparison of unique entries of the two sets) and populate them with infill, and then for the remaining entries reuse the overlap_dict that was populated from the train set, such as during simultaneous application in automunge(.) or in a subsequent operation in postmunge(.) making use of the postprocess_dict that was returned from the corresponding application of automunge.

Another alternate to nmrc we call nmr7 (or with similar differences an alternate method to nmcm we call nmc4) which is more efficient than nmrc but has looser requirements for test data unique values matching that of the train data. In this version, instead of creating an all new overlap_dict from scratch for the test set, we make a copy of the overlap_dict populated for the train set for use with the test set. Following a similar loop to that described above for nmrc, the test for whether an entry of unique values from the test set is present in the overlap_dict will potentially save us from several loops of string parsing for all those cases where the test set unique value was previously found in the train set used to populate the overlap_dict. Again this test set processing may take place simultaneous to the train set in automunge(.) or separately in postmunge(.) using an overlap_dict passed to postmunge in the postprocess_dict.

String Overlap String Parsing Methods

We have other string parsing methods that instead of checking for numeric portions of entries check for overlaps of sets of characters between different unique entries in the set. As an example of how this might be useful, consider a set of entries containing mailing addresses including city, state, and zip code. By simply one-hot encoding the entire set, the machine learning would not be exposed to all of the structure of the points of commonality in the different entries in this column. One method could be if the data has sufficient consistency in structure to create an algorithm to extract the order of groupings such as separated by a space, and return in separate columns. That’s not what’s being demonstrated here though. Instead we are introducing a method to parse the sets of string characters in each unique value, and compare it to the parsed set of string characters in each other unique value to identify any overlaps of commonality. In so doing we’ll establish a basis that overlaps of higher character length take precedence, and we’ll set a minimum length of overlap to avoid trivial overlaps. In the ‘splt’ method these overlaps are populated as activations in one-hot encoded sets for each identified overlap. In the ‘spl2’ method entries with identified overlaps are replaced with the overlap (of shorter character length) in a new column copied from the original. And in the ‘spl5’ method the replacement with overlaps is comparable to spl2 with exception that those entries that were not replaced are set to 0, such as might be beneficial if this column was supplementing as an encoding of the original column’s encoding so as to avoid the distraction of the training operation with redundant information.

The parsing process is as follows (this can be applied to either the train or the test set simultaneously in automunge or to the test set independently by passing those object derived from the train set used to process the test set in the postprocess_dict which is passed to postmunge).

This populates the overlap_dict for the train set, such that the keys of the overlap_dict are the identified overlap subsets of characters, and the values for those keys are a list of unique entries from the train set which contained that overlap as a portion of their characters. For the test set we’ll only worry about those overlap portions that were previously identified in the train set. After creating a list of unique values from the test set column and converting it to strings (aka the unique_list_test), we’ll initialize an empty dictionary to serve as the overlap_dict for the test set (aka the test_overlap_dict). We’ll grab a list of the keys from the train set overlap_dict and sort them by length (descending), and then populate the test_overlap_dict with each of the keys from the train set overlap_dict, with value entries being empty lists (we’ll call these keys the train_keys. Then follows this loop:

Which populates the test_overlap_dict. Now that we have our entries to the overlap_dict and the test_overlap_dict, we’ll create a new column for each identified overlap serving as a key and populate with all zeros save for the one activations in the rows corresponding to those unique values that were included as entries to the overlap_dict values lists corresponding to those keys, which completes the returned columns.

A variant of the splt transforms we call ‘spl8’ which is built on the assumption that the set of unique values in the test set is the same or a subset of those unique values which were found in the train set. Using this assumption, we don’t need to populate a separate overlap_dict for the test set, and can just copy the overlap_dict from the train set, which saves processing time in postmunge for instance.

The ‘spl2’ transform, as discussed above, shares some similarities to the splt transform, and follows similar methods for the initial population of an overlap_dict. Upon completion, a second type of overlap_dict is populated, we’ll refer to here as the spl2_overlap_dict, which sort of trades the places of the keys and values list entries of the overlap_dict, such that instead of having the identified overlaps as the keys the unique values are set as the keys, with the identified overlaps as the corresponding values. The process as currently implemented populates the overlap_dict first and then runs a loop to perform this conversion, in an alternate configuration this spl2_overlap_dict could be populated directly in this form initially by some changes to the splt loops described above. A separate spl2_overlap_dict is populated for the test set in a similar manner based on derivation from the test_overlap_dict. Once the spl2_overlap_dicts are populated, the source column is copied, converted to strings, and a replace operation is performed to replace those entries identified as keys in the spl2_overlap_dict with the identified overlaps.

A variant of the spl2 transforms we call ‘spl9’ which is built on the assumption that the set of unique values in the test set is the same or a subset of those unique values which were found in the train set. Using this assumption, we don’t need to populate a separate overlap_dict for the test set, and can just copy the overlap_dict from the train set, which saves processing time in postmunge for instance. Similarly, we don’t need to populate a separate spl2_overlap_dict for the test set, and can just reuse the version derived from the train set for the replace operation.

Another variant of the spl2 transform we call ‘spl5’, which follows the methods of spl2 save for the additional step of after populating the spl2_trasnform_dict, populating a third overlap dict we call spl5_overlap_dict which simply serves the purpose of designating unique values that were not included as keys to the spl2_trasnform_dict for replacement with 0. The spl5 transform also has a variant we call sp10 which is similar in analogy to the difference between spl2 and spl9, in that for sp10 we assume that the unique values in the test set are the same or a subset of the unique values found in the train set for more efficient application (negating the need to parse string values from the test set).

Powers of Ten Binning Methods

We have also recently rolled out some new methods for binning numerical sets based on the powers of ten associated with that row (e.g. 10⁰ for 1–9, 10¹ for 10–99, etc) with the transform we call ‘pwr2’. In our initial method for binning based on powers of ten, we only addressed positive values and subjected values <=0 to infill. In the new method we allow negative values to populate with binned columns by powers of ten as well. We also introduced an ordinal encoded binning by powers of ten. One part of this method that makes it so useful is that the populated columns are based exclusively on those properties found in the train set. So, for instance, if a test set has a numeric value whose power of ten was not identified for the train set, that row will simply not have an activation in the binnings. Or if a power of ten identified from the train set is not present in any of the values from the test set, that column will still be populated just with all zeros. Similarly, the ordinal encoding of the test set follows the conventions of the train set with similar considerations for consistency in binning designations.

Part of the new methods in pwr2 stem from the copying of the source column for separate treatment of negative values. After the column is copied, the set intended for negative values is converted to an infill designation (here nan) for values >=0, and then the remaining values have an absolute value applied in aggregate. The copy of the column intended for evaluation of the positive values is similarly treated such that values <=0 are converted to an infill designation (here nan). The power of ten is identified by a logarithmic transform at base ten using a floor method to convert any decimal places to rounding down to an integer. The set of unique values from the negative column of train set and positive column of the train set are used to populate dictionaries pairing keys of the unique values integers and values of the string we will use for the one-hot encoded column names, in the case assembled from an aggregation of the original column name, designation for -10^ or (+)10^ depending on if it is the negative or positive column, and then the integer used for this value entry’s key. Once the dictionary is assembled for the train set, a comparable dictionary can be assembled for the test set taking account for consistency in that unique values from the test set not found in the train set are omitted. These dictionaries are used to replace the integer values with their corresponding column names as entries in the cells, then combined based on fact that values in the positive version will correspond to infill rows (nan) in the negative version and vice versa. Once the positive and negative columns are combined a one-hot-encoding is performed to output a set of columns with boolean activations. After one-hot encoding, the set of columns returned for the test set is compared to the set of columns returned for the train set and any missing columns are inserted with all zeros as entries.

The processing for the ordinal encoding equivalent ‘por2’ follows similar methods through the combining of the positive and negative columns. Once the columns are combined an additional dictionary is populated to convert the string entries to ordinal integer encodings, with consistent integers used to encode the corresponding entries from the train and test set, or otherwise populated with a default infill method such as may be 0, 1, or an adjacent cell for instance.

Note that the processing of additional test data in the postmunge function is consistent, save for the fact that any data derived from the train set used to process the test set is passed to the postmunge function in the postprocess_dict returned from the application of automunge on the corresponding train set.

Conclusion

The Automunge software was designed as an open ended platform for the application of feature engineering transformations to tabular data. These are just a few examples of recent additions to our library of transformations. A user has options to defer to automation for evaluation of column data properties in determination of appropriate feature engineering methods, or may alternately assign custom feature engineering methods or even custom sets of feature engineering methods to distinct columns, such sets potentially including generations and branches of derivations. The feature engineering transformations may be sourced from our internal library of transformations, which we are continuing to build out, or alternately may be custom defined by the user, making use of just a few simple data structures, for incorporation into the platform. By making use of the platform for performing the final steps of feature engineering and data preparation prior to the application of machine learning in an external library, a user gets access to a whole host of useful push-button methods, such as segregated processing of training, validation, and test data such as to prevent data leakage, automated feature importance analysis based on shuffle permutation method, automated machine learning derived infill in a fully generalized and automated fashion, automated evaluation of subsequent data for drift in properties of data based on original evaluations of training data, and even automated preparation of data for oversampling in cases of class imbalance in the labels. In short, we make machine learning easy. More to come.