Automunge Complete

It’s Now or Never

Published in

Automunge

37 min readJun 24, 2019

Technology leadership is not defined by patents, which history has repeatedly shown to be small protection indeed against a determined competitor, but rather by the ability of a company to attract and motivate the world’s most talented engineers.
— Elon Musk, All Our Patents Are Belong To You

You have to watch those epiphanies at Burning Man. They’re not necessarily what you should pursue.
— Elon Musk, Rocket Man

The Beatles — She Came in Through the Bathroom Window

For those that haven’t been following along, I’ve been using this forum in recent months to document the development of automunge, a software tool intended to automate those steps of data preparation (aka wrangling or munging) immediately preceding the application of predictive algorithms to tabular data, whether for purposes of training a machine learning model or the corresponding consistent processing of subsequently available data to generate predictions from that trained model. The project started in a somewhat haphazard fashion as a learning and building experiment, but along the way through the act of iteration and application I believe has made some material improvements to the traditional python-based data-science workflow, in the way of a simple interface for applying our library of data transformation functions (which we are continuing to build out) to pandas dataframes, as well as the development of some very simple data structures that can be included with user-defined transformation functions. By applying data wrangling transformation functions within the platform, a user gets access to several simple and extremely useful automated methods such as feature importance evaluation, dimensionality reduction, prediction of infill for missing or improperly formatted data using machine learning models trained on the rest of the set, and perhaps most importantly the potential for consistent processing of subsequently available data with just the simplest of function calls, such that the software can be considered not just a static tool but suitable as a platform for data wrangling. In short, we make machine learning easy.

(image via JetBrains 2018 Python developers survey)

Part of the fun of the development process has been the use of the software’s progress as ample fuel for writing projects. I started blogging a few years ago and it has turned into a really fun hobby, and I think it would be fair to say that these two aspects of the project — software development and essay writing — have become mutually reinforcing to each other, meaning supportive in both directions. By targeting a somewhat regular and to be honest at times kind of aggressive publishing schedule it has forced an adhered regularity in rolling out functional and tested updates to the software, and somehow the more creative the literary aspects the more robust the software functionality turns out, it’s pretty cool how that’s worked out. Of course the hope was that this unique and creative approach to communications could translate to some kind of differentiator in establishing a user-base and/or readership audience (either would be nice) but well here we are. It doesn’t bother me entirely that this has not really panned out, the way I look at it I’m at least establishing some track-record of contribution, and hey even if it’s not exactly being recognized in real time that doesn’t preclude the possibility of some future user-base materializing. The hope is that at a minimum this degree of transparency will establish creditability and build trust, a perhaps necessary deliverable given my established track record of a somewhat unorthodox career path. Anyhoo presented here is a (detailed) explanation of how automunge works. This is intended to clarify the workings of the tool for purposes of establishing trust in a user base. To be honest you could probably just scan the diagrams and get most of the value here, trying to be thorough, although of course if you want the full implementation details you can always visit the python codebase on our GitHub. Or for full detail of the function arguments and parameters this is separately documented in the GitHub READ ME.

Man in the rudest state in which he now exists is the most dominant animal that has ever appeared on this earth. He has spread more widely than any other highly organized form: and all others have yielded before him. He manifestly owes this immense superiority to his intellectual faculties, to his social habits, which lead him to aid and defend his fellows…
— Charles Darwin, The Descent of Man

The Beatles — Golden Slumbers

The automunge function (3) has inputs including a “train” tabular data set (1) (intended for use to subsequently train a machine learning model) in pandas dataframe format and if available a consistently formatted and labeled “test” data set (2) intended to generate predictions from that model — along with a series of passed parameters. The function relies on Pandas dataframes for data wrangling with included column labeling for tracking steps of transformation so labeled columns are currently a prerequisite (a future extension may assign labels automatically if none in place, haven’t got around to that yet). If labels are available for the train or test set they are to be included as an adjoined designated column — note that some of the optional methods require the inclusion of a labels column. A user can also include a designated “ID” column to either of these sets which will be carved out unedited and consistently partitioned and/or shuffled in the train and validation sets. Full details of the parameters that can be passed to automunge are documented in the READ ME file available on GitHub. The function returns a partitioned version of the train set (24) segregated into train/validation1/validation2 and corresponding labels and ID column per user specifications, with feature engineering transformations applied such as to perform the numerical encoding necessary as a prerequisite for modern machine learning libraries as well as feature engineering methods which are expected to facilitate a more efficient training application or for potential improved model accuracy, with such transformations also available to be applied to the labels. Such feature engineering methods may be user specified to distinct columns from the automunge library, user defined, or alternatively applied automatically based on properties inferred from the data. The function also returns consistently processed test data (25), a dictionary we call “postprocess_dict” (26) (I know not exactly an elegant name whatevs) that can be fed to the postmunge function to consistently process subsequently available data intended to validate or generate predictions from the trained model — more on this below. If elected, the function also returns a dictionary containing results from a feature importance evaluation (27) providing an estimate of importance to predictions of the source columns as well as derived columns.

The postmunge function (29) is somewhat comparable to the automunge function, however it is intended exclusively for use of consistent processing of subsequently available test data (28) and requires input of the postprocess_dict (26) returned from the original application of automunge. Note that the postmunge function requires test data be consistently formatted (including consistent column labeling conventions) as the original train data (1) that was fed to automunge. The output of the postmunge function is simply a test set (43) derived from the set (28) consistently processed as the original application of automunge, with corresponding labels if a labels columns was included.

Before preceding with this narrative I’ll just offer a quick explanation that automunge does not rely on saving any variables to an internal state, everything passed and derived is stored in and shared between functions in the postprocess_dict which is subsequently returned from the function. I find this method simplifies troubleshooting as any scenario can be recreated by redefining a custom data store, and it certainly ensures that no user lock-in is being attempted for downstream use — so one could argue this might be kind of questionable business acumen but you know what the whole point here is to build trust. It should also be noted that within the postprocess_dict are stored a few subcategories of data storage dictionaries, including groupings we refer to as the ML_cmnd, transform_dict, process_dict, and column_dict. I’ll provide more details on these items further below, for now if you see any of these items mentioned just know it is one of the groupings of parameters saved within the postprocess_dict used to pass objects and parameters between functions.

Within the automunge function (3), if elected first the feature importance evaluation (4) is conducted for the train data and corresponding labels (1) passed to automunge, noting that the inclusion of a labels column is necessary to conduct. Within the feature importance method (4), the train data is first fed to the automunge function (4a) with some feature importance specific parameters such as to process the data to prepare for predictive algorithms (note that within this internal automunge call feature importance evaluation and also PCA dimensionality reduction are turned off), returning processed train, validation, and corresponding labels sets. The returned train and label sets are used to train a predictive model (4b). The model training takes into account the type of labels passed using category specific label features accessed in the process_dict (4bi). The model is initialized (4bii), noting that the user has the ability to pass parameters to the model from the ML_cmnd parameters passed to original application of automunge (4), and quite simply the model is trained (4biii). The accuracy of that model is evaluated (4c) using the validation sets from (4a), with that measured accuracy saved as “accuracy1” later used for our evaluation metrics. Once the base model is trained, for each column in the processed train set returned from (4a) a pair of column specific accuracy metrics are derived. For the first metric dubbed “metric” a new validation set is derived (4e) by first using the column’s entry in the column_dict to access a list of columns derived from the same source column (4ei) which are each consistently shuffled in the validation set with the same random seeding (4eii) to return a revised validation “shuffle set” (4e). That shuffle set is used with the labels set from (4a) to derive an “accuracy2” (4f) using the same model trained in (4b). The shuffle and evaluation steps are repeated with a slight variation to derive the second metric “metric2” used to evaluate the relative feature importance between columns derived from the same source column. The shuffle set used for this second metric are derived (4g) by accessing again the list of columns derived from the same source column in the column_dict (4gi) and while leaving the current column unshuffled, shuffling the other columns that were derived from the same source column (4eii), which set is used to measure an “accuracy3” (4h) by passing this second shuffle set and labels returned from (4a) to the model trained in (4b). The returned metrics “metric” and “metric2” are derived (4i) as metric = accuracy1 — accuracy2 and metric2 = accuracy1 — accuracy3, which metrics are populated in the Feature Importance Results (26) returned from automunge function. Note that larger metric values imply greater predictive significance of the source column, and smaller metric2 values imply greater relative predictive importance between the columns derived from same source column. As a final step of feature importance evaluation, for cases where a user passed parameters to automunge function associated with use of feature importance evaluation for dimensionality reduction, which will be either a percent of columns to return or alternatively a value of feature importance metric “metric” used as threshold, a list of columns is assembled which are to be preserved from the trimming operation (18) performed later in the automunge pipeline.

Following the optional feature importance evaluation, automunge next performs some validations (5) to ensure that the passed train and test sets (1 & 2) have consistent properties such as number of columns and consistent column labels for instance — although note it is allowable to pass a labels and/or ID column to only train for instance. The data sets are then segregated between train or test data, labels, and ID columns (6).

Automunge next begins a for loop through each column (7), noting that this for loop addresses both the train set and if included the test set simultaneously. The column is checked for cells that will need to be subject to infill (8), with corresponding boolean identifiers of these rows saved into a separate dataframe called NArows. The column is then evaluated based on properties of the data to assign a category for automated feature engineering methods (9). For example, a column with most common data type numerical is a candidate for numerical processing methods, a numerical column with all positive values may be a candidate for a power law transform, a categorical column may be a candidate for one hot encoding, or a column with most common data type date-time will be a candidate for time series data methods. The function follows a series of logic steps to evaluate the column, and as currently implemented makes use of the python “collections” library for finding a most common data type as well as the “datetime” library. A future extension of this method will likely extend evaluations to more fine grained categories such as numerical sets following different distributions or datetime date with or without time zone information. Some methods will likely also be developed to recognize text such as to extract language properties such as sentiment classifiers or names or places, all of which to facilitate more fine grained approaches to feature engineering under automation.

The processing functions (10) are the means for applying a series of feature engineering transformation functions to each column. They simultaneously process the train data and test data (or if no test data was passed they create a small dummy test set for this purpose), and the steps of processing include both the feature engineering methods applied to the dataframes as well as the maintenance of the corresponding entries to the postprocess_dict which captures all of the associated entries required for later consistent processing in the postmunge function (29). They use as a key for the methods applied a category which is traditionally represented by a four character string. Each category has a corresponding entry in the “transform_dict” capturing a “family tree” for the steps of processing to be applied as well as a corresponding entry in the “process_dict” for passing the defined transformation function as well as some properties associated with each category for purposes of predictive algorithms or infill. Note that a user can pass custom family trees for steps and branches of processing functions through the automunge parameters, more on that discussed in some of our recent essays. As the first step of processing (10a), the processing functions check if a user assigned a specific category to the column and if not it assigns that category derived from the evalcategory function (9). Also needed as input for the processing functions are the identifiers of rows subject to infill (10b) which were assembled in (8).

The family tree primitives are means of identifying the steps for a “tree” of transformations. Each category which can either be inferred from (9) or assigned in (10a) as a “rootkey” entry in the transform_dict, and for that rootkey each primitive may be assigned a list of category entries. Each primitive category entry has a corresponding process_dict entry which carries the associated transformation functions.

The processdict entries have three different types of processing functions that can be assigned. A “dualprocess” function is intended for simultaneous processing of train and test sets for cases where one needs to pass normalization parameters derived from the train set to the test set. A dual process function entry will always need a corresponding entry to “postprocess” intended for just processing the test data, using normalization parameters from the original dual process function on train set stored in the postprocess_dict. If no normalization parameters are needed from the train set to process the test set, one can instead assign “None” to the dual process and postprocess entries and instead assign a “single process” function which will be applied to the train and test sets one at a time. For further detail of the entries in the processdict as well as requirements for user-defined transformation functions, the composition and data structures are documented in the READ ME available on GitHub and won’t be further documented here in interest of literary composition.

Through his powers of intellect, articulate language has been evolved. As Mr. Chauncey Wright remarks “a psychological analysis of the faculty of language shows, that even the smallest proficiency in it might require more brain power than the greatest proficiency in any other direction.”
— Charles Darwin, The Descent of Man

The Beatles — Carry That Weight

The processing functions are broken into three categories, process ancestors (10c) is intended for the great-grandparents and grandparents primitive categories which are only accessed within the first generation of application. Process family (10d) is next called for processing the parents, siblings, auntsuncles, and cousins primitive categories. Finally the circle of life function (10e) is called for trimming columns marked for deletion in (10c, 10d). Note that each of these operations requires maintenance of the associated column_dict data structures stored in the postprocess_dict which is a small source of complexity in implementation.

In further detail, the process ancestors function (10c) takes the column category from (10a) and uses to access an entry in the transform_dict for family tree primitives (10ci). For the greatgrandparents primitives it calls the processparent function (10cii), and for the grandparents primitives it calls the process cousin function (10ciii). (Forgive the naming convention for (10cii) and (10ciii) process parent and process cousin functions, this is just intended to illustrate that the processing mechanisms will be comparable to the parents or cousins primitives, with the difference being that (referring to the definition of primitives table shown above) the applied to generation actions (first vs all) and column action (supplement vs replace) take place outside of the process parent / process cousin functions, in this case within the process ancestors function (10c). Similarly, the process family function (10d) uses the column category from (10a) to access an entry in the transform_dict for family tree primitives (10di). It processes the cousins primitives using the process cousin function (10dii), it processes the siblings primitives using the process parent function (10diii), it processes the auntsuncles primitive using the process cousin function (10div), and it processes the parents primitives using the process parent function (10dv). The column action (supplement vs replace) step takes place in (10dvi) in which columns with primitives subject to replacement (parents and auntsuncles) are marked for deletion. That deletion takes place in the circle of life function (10e), which function requires the maintenance of the column_dict data stores saved in postprocess_dict (10ei) followed by the simple deletion of the column (10eii).

Thus each upstream primitive is processed either within the processancestors (10c) or processfamily (10d) with one of either the process cousin function (10ciii, 10dii, 10div), or the process parent function (10cii, 10diii, 10dv). Expanding on these two approaches, in the process cousin function we use the passed primitive as a key to access the processing functions saved in process_dict (a), noting that we discussed earlier that some of the categories (requiring normalization parameters to be derived from the train set for application to the test set) will be assigned two separate processing functions — a dualprocess for the simultaneous processing of train and test sets coupled with a postprocess function for the subsequent processing of the test set in the postmunge function (29) based on normalization parameters saved in the postprocess_dict during the application of the dualprocess, or alternatively just a singleprocess function which can be applied independently to either the train or tests sets without the use of normalization parameters derived from the train set. The next step in the process cousin function is to apply the processing functions (b), either applying a dualprocess function simultaneously to the train and test sets or alternately applying a singleprocess function separate to each of train and test sets based on the entry in the process_dict. Once the processing functions are applied, the associated column_dict entries in the postprocess_dict are updated correspondingly (c).

The process parent function (10cii, 10diii, 10dv) starts similarly to the process cousin function, in that we use the passed primitive as a key to access the processing functions saved in process_dict (a), for which either the dualprocess or singleprocess function (depending on the process_dict entry for that category) is applied (b). However primitives for application of process parent function are by definition subject to downstream offspring. Those downstream offspring are any niecesnephews, children, coworkers, and friends primitives whose categories are stored in the transform_dict (c). Referring back to the definition of family tree primitives table, note that niecesnephews are treated as downstream siblings, children as parents, coworkers as auntsuncles, and friends as cousins — by which I’m referring to their treatment for column action (replace/supplement) and potential for offspring of their own. Coworkers and friends are processed with process cousin (d, e), and niecesnephews and children are processed with process parent function recursively (f, g), followed by marking the columns subject to replacement for deletion (h).

After the process functions are complete for a source column the associated entries are stored in the postprocess_dict (11) such as to support infill methods and to facilitate consistent processing of subsequently available test data in postmunge.

Having completed the for loop through columns for processing (7), the next for loop through columns is for purposes of addressing infill for missing or improperly formatted values (12) which were identified in (8), noting that some initial infill is meant to be applied within the process functions (10) such as to enable derivation of normalization parameters. Default infill is either ML infill if the ML infill parameter to automunge is activated, or otherwise defaults to mean for numerical values, most common value for binary values, or boolean identifier for multi-set categorical values. A user may also assign distinct infill methods to each column by passing column names to assigninfill object passed to automunge, with current options for infill including default, ML infill, zeroinfill (plugging infill with value 0), or adjinfill (plugging infill with value from adjacent cell). For cases where ML infill is applied (13) a few support functions are called (14–17).

ML infill refers to the derivation of infill using predictive models trained for each column based on properties of the train set, which is used to predict infill for both the train and test sets. The createMLinfillsets function (14) uses as input the full (post-transform) datasets originating from (1, 2) as well as information about category and what columns were derived from the process functions (10), information that is available from the postprocess_dict. The createMLinfillsets function generates a series of data subsets used as input to the predictinfill function of (15). Those data sets are derived based on the designated missing rows from (8) and include training data to train a model to predict infill, the labels for training a model to predict infill, and the features that will be applied to the trained model to predict infill for the train and test sets. The training data to train a model to predict infill and the labels for training a model to predict infill originate from the train set and are derived by removing the rows corresponding to missing cells identified in (8), with all columns originating from the current columns source column stripped from the train data and only the current column left in the labels set (noting that the labels column may still be a multi-column set such as if the labels are one-hot encoded). The type of labels (such as difference between numerical, single column categorical, or multi-column categorical) is identified by the “MLinfilltype” entry in the proccess_dict (14a) for the current column’s category which was identified in (10a). The final of these sets, the features that will be applied to the trained model to predict infill, is generated for both the sets originating from the train and test sets. To avoid “data leakage”, a training set associated with a specific post-transform column excludes any additional columns that originated form the same source column.

The implementation of these sets’ preparation for (14) starts with the access of MLinfilltype entry in the proccess_dict (14a) for the current column’s category which was identified in (10a). Next the boolean identifier of rows that will be subject to infill that was collected for train and test sets (8) is concatenated onto the current train and test sets (14b), which we’ll refer to as train set b and test set b. The train set b is copied as train set c, followed by deletion of columns originating from the same source column, and deletion of rows with activated identifier for NArow (subject to infill) (14c), which train set c is to serve as the training set for ML infill. The train set b is copied to train set d, with rows deleted with activated identifier for NArow and all columns deleted except for the current column, which is to serve as the labels for ML infill (14d). The train set b is copied as train set e, with columns deleted originating form same source column, and rows deleted without activated NArow, which train set e to serve as the features set for predicting train set infill (14e). The test set b is copied as test set c, with columns deleted originating form same source column, and rows deleted without activated NArow, which test set c to serve as the features set for predicting test set infill (14e). Finally the currently included NArows column is deleted from each of these returned sets (14g).

With these sets as input, the predictinfill function (15) then applies a machine learning model to predict suitable infill for missing values in each column based on a model trained from the rest of the train set. The implementation starts with the access of MLinfilltype entry in the proccess_dict (15a) for the current column’s category which was identified in (10a). The model is initialized (15b), noting that the user has the ability to pass parameters to the model from the ML_cmnd parameters passed to original application of automunge (4), and quite simply the model is trained (15c) using the returned sets from (14c, 14d). This trained model is used to predict infill for train and test sets (15d) using the returned feature sets from (14e, 14f). For numerical label data a regression model is applied and for categorically encoded data a classifier model is applied. The same function can be applied using various machine learning architectures such as for numerical data linear regression, passive aggressive regressor, ridge regression, ridgeCV regression, support vector regression, random forest regression, or regression via gradient boosting; or for categorical encoded data such as logistic regression, stochastic gradient descent classifier, support vector classifier, random forest classifier, or regression via gradient boosting. Each of these different architectures may require different considerations for hyperparameters. The predictinfill function also performs derivations from properties of the dataset such as to adjust hyper parameters for the associated architecture. For example, properties considering the relationship between the number of features and the number of rows may be used to adjust the architecture applied or adjust associated hyperparameters. For cases where training time becomes a performance issue due to scale of the train data set from item 1, the function may select only a subset of rows for purposes of training the infill models. As currently implemented the function makes use of Random Forest methods from the Scikit library, a future extension could also easily incorporate automated ML frameworks within the same framework such as make use of automated hyperparameter tuning, architecture selection, or ensemble methods.

The output from the predictinfill function (15) is a set of infill corresponding to each row that originated from either the train or test sets (1, 2) with missing or improperly formatted data identified in (8). This infill serves as input to insertinfill function (16) (which is applied to the train and test sets separately, so we’ll refer to either set in this paragraph as the “target set”), along with other data about the column such as the train or test infill set generated from (15d) and the column data saved in the postprocess_dict (11). The insertinfill function then incorporates that infill set to each of the corresponding rows in the current column, replacing infill that was previously incorporated as part of the process functions (10). The implementation of the insertinfill function (16) starts with the access of MLinfilltype entry in the proccess_dict (15a) for the current column’s category which was identified in (10a).

If the MLinfilltype entry is a single column set (16b), a new index value column is assigned to the target set (16c), and the target set is then concatenated with the NArow column identified in (8) (16d). A list of index numbers from (16c) is generated for those rows corresponding to activated NArow column which we’ll call the infill index list (16e). A dictionary is assembled matching the infill index list (16e) with the infill values associated with the target set (14e or 14f). The values in the index column from (16c) matching the keys in the (16e) dictionary are replaced with the corresponding infill value from that dictionary (16g). The values from the index column are pasted over the current column for rows with activated NArows column (16h). The support columns such as the index value column and NArows column are then deleted (16i).

If the MLinfilltype entry is a multi column categorical set (16j), for each column in the multi column set (16k), a new index value column is assigned to the target set (16l), and the target set is then concatenated with the NArow column identified in (8) (16m). A list of index numbers from (16l) is generated for those rows corresponding to activated NArow column which we’ll call the infill index list (16n). The index numbers from (16n) are concatenated as a column to the infill set from (15d) (16o). A mask is applied to the infill set (16o) to only active in rows corresponding to infill for the current column from (16k). Using the masked column from (16p) as a key, an infill of 1 is inserted to column from (16k). The support columns such as the index value column and NArows column are then deleted (16r).

The trained model from the predictinfill function (15) is saved for each column in the postprocess_dict (17), which is final step in the Automunge infill operation.

Following the completion of processing and infill functions, the next step in automunge is the application of optional methods for dimensionality reduction of the returned sets. The feature importance evaluation can be used for cases where a user passes parameters to automunge function associated with use of feature importance evaluation for dimensionality reduction, which will be either a percent of columns to return or alternatively a value of feature importance metric “metric” used as threshold below which columns are trimmed. The identification of columns to be trimmed was performed in (4j), step (18) is simply the application of this trimming operation via the circle of life function (10e) such as to maintain the associated data structures with the removed columns.

The next optional dimensionality reduction method applied is via PCA (Principal Component Analysis) (19), which is currently implemented via the Scikit library and is applied to the train and/or test sets simultaneously. The user specified inputs for the PCA methods are via passed parameters to the automunge call, including those passed via the ML_cmnd object and also the PCAn_components (specifying the n_components argument to pass to Scikit’s PCA implementation) and PCAexcl (a list specifying columns to exclude from the PCA dimensionality reduction). The evalPCA function (19a) evaluates properties of the train set to assign a default type of PCA for the dataset (such as PCA, Sparse PCA, or Kernel PCA). For example, Kernel PCA requires an all non-negative set and Sparse PCA is more memory efficient that regular PCA, although Scikit’s PCA is the only option when the n_components value pass is a float between 0–1. The evalPCA function (19b) also uses heuristics to evaluate whether to apply PCA automatically such as based on the ratio of number of features to number of observations for cases where n_components is not specified by user. Note that automated PCA application can be turned off in the automunge call via the ML_cmnd object. If PCA is to be applied (such as due to the automated option from (19b) or alternatively due to the passed PCAn_components parameter to automunge), the next step is to initialize the default parameters for the PCA model with the function populatePCAdefaults (19c). The initialization of the PCA model is performed in (19d), but with user passed parameters from the PCA_cmnd portion of the ML_cmnd object passed to automunge given priority over the default parameters from 19c. The PCA model is fit to the train set (19e) (noting that PCA is unsupervised learning such that no labels are required), and the model is saved in the postprocess_dict. That model is then used to transform the train and test sets to a reduced number of columns with a new naming convention. Note that this process of fitting a model and using to transform the train and test sets takes place after segregating the columns passed to automunge under PCAexcl, and once the PCA transformation has been applied those columns are reattached. Note that the transformed columns from a PCA application are assigned new column names following convention ‘PCA#’. The PCA application returns a processed version of the train and test sets (19g, 19h) as well as an updated postprocess_dict (19i).

The labels column (which was optionally included as a designated column in the train set and optionally also in the test set) were segregated from the train and test sets in (6). The processing of the labels (20) via feature engineering transformations is performed very similarly to the processing steps (8–11). Categories for processing can be assigned to the assigncat object passed to automunge or alternatively a user can defer to the default methods based on the inferred properties of the data from the evalcategory function such as in (9). Note that as currently implemented the default label processing for numerical columns is simply to leave numerical data unchanged, a user can assign normalization methods for instance by passing a different category in assigncat. Also note that to support the reverse encoding of labels such as for instance back from boolean designators to the original category values an object is returned from the automunge application called “labelsencoding_dict”. Note that automunge does not perform infill to label columns, instead it defaults to deleting any rows in the train or test sets corresponding to missing values in the labels column.

Hat tip fast.ai lectures for introducing me to the class imbalance concept

Automunge has an optional method for purposes of oversampling in cases of frequency imbalance in the labels categories via the LabelFrequencyLevelizer function (21). Note that this method is available to be applied to numerical label sets if the labels processing includes returned standard deviation bins (“bins”) as part of the processing family tree. The method first accesses the “labelctgy” entry in the process_dict based on the category of label either assigned or inferred through the labels processing (20), which entry will for instance identify whether the label has a single numerical target, a single categorical target, or a multi-column categorical target set (21a). The count for each label class is collected (21b), again noting that in the case of numerical labels with standard deviation bins those counts will be for the bins. A label multiplier is derived for each class based on the ratio of the count of the label class with the max count to the count for each class (21c). A for loop is then run through each label class (21d) in which a set is carved out from the train set, ID set, and labels set corresponding to rows with that class present (21e), which set is then concatenated onto the train set and labels x number of times (21f) where x is the multiple from (21c). The thus expanded train set and correspondingly expanded labels and ID sets are then returned from the function, such that when they are used to train a machine learning model the training operation will oversample those labels with lower frequency in the original set.

The automunge function then saves a few more global parameters (such as those parameters from the original function call) to the postprocess_dict (22) for later use in the postmunge function. The postmunge function (29), which serves the purpose of consistently processing subsequently available test data as per the application of automunge from which a postprocess_dict is derived, is also used within automunge such as to consistently process the validation sets (23). Ensuring that none of the validation data is used in the derivation of the normalization parameters derived from the train set is a protection against data leakage.

The resulting output from the automunge function (3) thus includes the following four items: the consistently processed train and validation sets (24) derived from (1) (consisting of train, labels, ID, validation1, validation1labels, validation1ID, validation2, validation2labels, and validation2ID based on user partitioning of validation ratios), the consistently processed test set (25) if one was passed to automunge as (2), the postprocess_dict (26) containing all of the tracked data necessary to consistently process subsequently available data in the postmunge function (29), and the feature importance results (27) derived in (4). The returned train and validation sets (24) are intended for use to train, tune, and validate a machine learning model, and the returned test set (25) is intended to generate predictions from that same trained machine learning model. The returned sets are suitable for direct application to a machine learning model in the framework of a user’s choice.

The postmunge function (29), which serves the purpose of consistently processing subsequently available test data as per the application of automunge from which a postprocess_dict is derived, has inputs including the postprocess_dict object (26) returned from automunge (3), and a new “test” tabular data set consistently formatted and with consistent column labels as the train set (1) originally passed to automunge (3). This test set may optionally include a designated label column consistently formatted as a label column passed with the train set (1) to automunge and/or a designated ID column. The steps of postmunge won’t be discussed here in as much detail, as they bear much similarity to the steps of automunge. However a key difference is that through the postmunge application the methods are all based on accessing details of things such as normalization parameters and passed Automunge parameters from the passed postprocess_dict object (26). Walking through postmunge from a high level standpoint, first there are some validation steps performed on the test data (30) to ensure consistency with the original train data passed to automunge (comparable to (5)), then a for loop is run through each column (31) in which we identify a key for accessing that column’s parameters from the postprocess_dict (32) (comparable to (7)), identify rows with missing cells which will be subject to infill (33) (comparable to (8)), and then apply the process functions (34) (comparable to (10)). Note that the application of the processing functions uses different functions for singular processing of the test data than the functions used in automunge which processed both the train and test sets together, more specifically when a processing function is accessed in the process_dict it will use either an entry for postprocess or single process (in comparison to the automunge (3) applications in (10) which makes use of either dualprocess or singleprocess entries in the process_dict). Also just want to quickly apologize for the admittedly somewhat confusing use of terms postprocess_dict and process_dict terms — unfortunately the two objects do not bear the relation that one might infer from the naming convention, I guess this is an example of technical debt, just be aware that this could be an easy point to confuse. After processing, a for loop is run through the columns to apply infill (35) (comparable to (12)), and if the infill method is ML infill (36) ((13)), we create the infill sets (37) (like (14)), predict infill (38) (like (15)), and insert infill (39) (like (16)). If dimensionality reduction was elected in automunge via either feature importance (18) or PCA (19) then it is consistently performed in postmunge (40, 41). If any designated labels were included in the test set (28) then they are consistently processed (42) (like (20)). The postmunge function then returns a test set (43) consistently processed as the train data returned from automunge (24).

He has made rafts or canoes for fishing or crossing over to neighboring fertile islands. He has discovered the art of making fire, by which hard and stringy roots can be rendered digestible, and poisonous roots or herbs innocuous. This discovery of fire, probably the greatest ever made by man, excepting language…
— Charles Darwin, The Descent of Man

The Beatles — The End

A funny thing is that I started this essay with the intent to perform a strategic analysis of my firm, you know with respect to competitive environment, heck could have done like a Porter Five Force analysis or something. But the truth of it is Automunge has never really gotten to the point where that could be more than fiction. We have no paying customers, no real user base to speak of, and heck no real team. It’s a one man shop. In my capacity as an entrepreneur I write open source software, and in my capacity as a blogger I write essays. I wish the circumstances surrounding the founding were some heroic epiphany that I could do better than the 9–5 world, but the truth is most of what has kept me going is the Kafkaesque medical establishment I found myself stuck in in the city of Houston, which has proven incapable of establishing some basis for medical accommodations, leaving me faced with the impossible choice of scheduled working hours that I can’t consistently meet. I became an entrepreneur out of necessity. But you know what? Now that I’m here, well heck, I’m finding I kind of like it. And what happened last time was as soon as I started to make material progress, I became employable again, which set the project back a few months. I won’t make that mistake again. I’m not ruling out taking another job, but it’s got a lot higher bar. That’s kind of part of the reason I’ve only haphazardly sought funding, trying to optimize for optionality, external funding is a commitment, and so really was hoping for some resemblance to product market fit before taking that step. Of course that’s where the essays come in. Sure I could cold call to every mom and pop data science shop and get a user here or there, but it’s like Naval Ravikant said in his great tweetstorm on How to Get Rich (without getting lucky), I’m trying to apply leverage via code and media, something that works for me while I sleep (which I do a lot of), that’s what scales. Of course that strategy presumes that I might eventually find a reader or two. It is not an exaggeration to say that I can’t even get my stuff published on Reddit, seriously I tried to share an essay to the data science subreddit they wouldn’t even publish it. It boggles my mind. I guess that’s where I’m asking you the reader to consider giving me a retweet or a share. This software is immensely useful, it has novel functionality, it makes the life of a data scientist very very easy — with no lock in! The whole idea was that we would make this software open source and able to run on a user’s local session free of charge with any monetization coming from a subsequent implementation making use of external computing resources for larger data sets or more elaborate feature engineering methods. That is still the goal but something has to give — either I need the credibility of a funding source or some traction on a user base, and tbh I’m starting to get the impression that it’s a real chicken or the egg problem. The licensing approach of GNU GPL3 well I’m starting to second guess. I did a survey of the mainstream data science libraries, you know like Pandas, NumPy, TensorFlow, Scikit, etc., and they all have some variant of BSD, Apache, or MIT. Automunge’s GPL3 is “copy left”, meaning those who re-use the code must do so within the context of a comparable license. The thought was that this would allow me to claim intellectual property rights for seeking licensing revenue from those trying to incorporate into a commercial offering. I mean that’s kind of another potential revenue stream, hey if you’re an investor and you’re listening this thing is pretty much turnkey, just needs to file for a patent before the clock runs out. I’m not very good at networking (is an understatement), I’m hoping a credible resource may approach me based on the essays. If not there’s a real possibility this may end up BSD before much longer, ok here’s some game theory for you Fortune 100 Data Science competitors, you can simply wait me out and assume no one ever reads this thing and hey perhaps you might get this code for free — or I don’t know if you want to de-risk perhaps you could make me an offer. I remember a story about some investors missing out on Uber, don’t let that be you!

READ ME link. Lot’s of demonstration notebooks linked in the preceding essays as well.

Ok this has nothing to do with anything, but there was an influential essay that kind of helped me get started on this journey that I’ll just offer as a quick recommendation — “6 Harsh Truths That Will Make You a Better Person” by David Wong (you may know him as author of John Dies at the End). Was kind of a wake up call. And these essays have been kind of my own attempt at making some contribution. I’m concerned that having put so much into the project that there might be a little inconsistency avoidance tendency like Charlie Munger wrote about in his essay “The Psychology of Human MisJudgment.” Munger cited Charles Darwin as training himself to intensely consider any evidence tending to disconfirm a hypothesis of his, more so if he thought it was particularly good. There is very strong evidence before me that this software, heck probably even this book of essays for that matter, will never amount to more than just a personal diary. (You know perhaps Einstein, who with his miraculous year in 1905 laid the foundation for modern physics, would have been more accepting of the principles of quantum dynamics if he had paid this inconsistency avoidance principle a little more heed.) I’ve always prioritized optionality, and pending any external funding I’m certainly not committed to this software, but like Jeff Bezos has said there’s a regret minimization framework at play. I know I’ll be able to live with myself much more if I follow through with this opportunity and heck I wouldn’t be doing it if I didn’t think there was real opportunity here. Besides even if any financial rewards may take a while to materialize, there’s a Nassim Taleb aphorism that applies: Being an entrepreneur is an existential not just a financial thing. I’m a little worried that I might find myself in a scenario like Hemingway’s The Old Man and the Sea, I feel like this project could turn out to be a 500 pound marlin and I’m just in this tiny boat and sure I’ve got it hooked but getting it to shore? Well I’m the only one in this boat and heck perhaps if I had a few crew-mates that might make it a little more feasible. I gave Texas a try, wasn’t very good fishing in those waters, I figure if we’re going to colonize Mars heck a little move to another state shouldn’t be too hard, just a matter of making sure the boat is big enough for the trip. Happy to be back in Florida, I know not exactly Silicon Valley but what can I say I’m sort of sentimental is both a weakness and a strength.

As we enjoy great advantages from the inventions of others, we should be glad of an opportunity to serve others by any invention of ours; and this we should do freely and generously.
— Benjamin Franklin, The Autobiography of Benjamin Franklin

There is an hour of the afternoon when the plain is on the verge of saying something. It never says it, or perhaps it says it infinitely, or perhaps we do not understand it, or we understand it and it is as untranslatable as music…
— Jorge Luis Borges, The End

Hat tip to Stack Overflow, an invaluable tool as I taught myself python. One way to think about automunge is that it is just a continuation of my attempts at beginner Kaggle competitions such as previously documented in my essay “My First Kaggle”, so hat tip to Kaggle for inspiring this project. Hat tip to the TWiML podcasts for helping me keep up with current events (I’ve fallen a little behind need to start listening more) as well as organizing study groups for fast.ai. Hat tip to fast.ai lectures by Jeremy Howard such as for introduction to shuffle permutation and class imbalance concepts as well as discussing time series data, honestly really well done. Hat tip to the Houston / University of Houston machine learning meetup organized by Yan Xu for giving me an excuse to re-read the Deep Learning text, and a stream of educational speakers. Hat tip to Medium for being such a polished publishing platform (although tbh their value as a distribution mechanism is somewhat less obvious to me). Hat tip to Francois Chollet’s Deep Learning With Python for helping me think about basics of normalization and getting me started on my first Kaggle competition, Aurélien Géron’s Hands-On Machine Learning with Scikit-Learn and TensorFlow for getting me started with Scikit, Jason Brownlee’s Better Deep Learning for helping me think about data leakage between train and validation sets, Alice Zheng and Amanda Casari’s Feature Engineering for Machine Learning for introducing me to the Box Cox transformation, Min-Max scaling, and bins groupings of numerical sets, hat tip Sebastian Raschka’s Python Machine Learning for helping me think about data preparation issues, Theodore Petrou’s Pandas Cookbook for introducing me to the principles of “tidy data”, Wes McKinney’s Python for Data Analysis for helping with my Pandas skills and introducing me to the term “munge”, Ian Goodfellow, Yoshua Bengio, and Aaron Courville’s Deep Learning for being the best resource I’ve found for understanding the theory behind modern deep learning — that’s a direction I’d like to take this tool more to come. In the mean time, hat tip to the scikit-learn library for being so simple to operate that I didn’t even need to read a book on it :), honestly the scikit library in my opinion is the gold standard for managing complexity of machine learning, so easy to use. Hat tip to Google’s Colaboratory for stripping all of the complexity out of cloud based coding, and hat tip to Anaconda for stripping all of the complexity out of desktop coding :). Hat tip to Saku Panditharatne’s MOOC for getting me started on AWS before I had Colaboratory to work with, and Andrew Ng’s MOOC for giving me such a strong foundation on core concepts of machine learning project workflow to build from. Hat tip to Steve McConnell’s Code Complete for giving me some foundation in software development (one of these day’s I’m going to read the second half I swear), hat tip to Brian Kernighan and Dennis Ritchie’s The C Programming Language for helping me realize that software can be understood, and Stephen Wolfram’s An Elementary Introduction to the Wolfram Language for helping me realize that software can be both simple to use and extremely powerful all at the same time. Hat tip to Quantum Country for inspiring the Einstein quote I’m going to include below. Hat tip to the Essays of Paul Graham for inspiring entrepreneurship (wondering if that HBO series startup Pied Piper was a nod to those essays) and UF for getting me started all those years ago. Hat tip to Nassim Taleb and fellow graduates of the Real World Risk Institute for guiding my thinking on risk taking in the real world. Hat tip to my ridiculously inept networking skills which has made this journey necessary, hoping these essays may some day do the job for me.