Announcing Automunge Inversion

Our first press release

Nicholas Teague
Automunge
5 min readMay 16, 2020

--

Attention all journalists of every major news organization and bloggers too let’s not forget the bloggers. Oh I mean if you’ve got a twitter account you’re basically a blogger so I guess we’re all journalists now right? Where do you draw the line? I’ve got it just had an idea: if your publication has a table of contents that means it’s real. Very simple rule let’s go by that. Sorry newspapers.

Ummm, anyway kind of rambling here. Automunge is hereby pleased to announce our very first press release. This is it, this is a press release and it’s real and this is happening. Ok well I just did a quick internet search for press release templates and, umm, couldn’t find anything right up to my journalistic standards so just going to go off the cuff. This press release is hereby off the cuff.

For immediate release to the press:

Automunge is a Central Florida startup offering an open source python library platform to prepare tabular data for machine learning. We have followed an incremental approach to software development, with frequent updates rolled out sequentially over the course of the preceding year. Core features of our library include automated numerical encodings and machine learning derived infill to missing data, such as to allow raw tabular data to be passed directly to machine learning algorithms without manual intervention. The library also serves as a platform for feature engineering, in which transformation functions may be assigned to distinct columns in sets, including generations and branches of derivations, using our “family tree” primitives. The transformation functions are “fit” to the properties of a designated training set, which enables consistent and efficient preparation of subsequent data at low computational overhead. The automated methods for numerical encodings and infill are fundamental challenges for the data science workflow that until now have largely required manual interventions. It is only a slight exaggeration to say that we have solved data science. (Ok that might be more than a slight exaggeration, but I mean really this software is a big step for formalizing and standardizing the tabular data workflow for which industry practice has until now been largely fragmented.)

Today’s Announcement:

Automunge is pleased to announce the rollout of our new inversion option. The inversion feature is a complement to our library which allows the recovery of original data formatting. In other words, the automunge(.) function prepares tabular data for machine learning by applying a series of data transformations. The inversion option as applied in the postmunge(.) function is now able to recover the form of data preceding these transformations for both train/test sets and label sets as may have been returned from these transformations. In the context of the machine learning workflow, this type of operation may be particularly useful for the recovery of label set formatting after the generation of predictions.

Part of the challenge for implementing an inversion function in the Automunge library arrises from the nature of applying forward pass transformations in sets with generations and branches of derivations, such as enables presentation of feature sets to machine learning algorithms in multiple configurations of varying information content. The inversion function must thus select a path of inverse transformations, selected from the options of various returned columns originating from a single source column. Automunge has addressed this problem by populating a new data structure mirroring the application of transformations to each column of the data, including a forward pass mirror and a backward pass mirror. The forward pass mirror has at its first tier the root transformations categories and associated source columns as originally passed to automunge(.), which are then seeded with each subsequent generation of transformation returned columns entered as an embedded dictionary matching the preceding tier’s structure. The backward pass mirror has instead in its first tier the collection of columns returned from the set of forward pass transformations, further aggregated by transformation categories associated with those entries, such that each of the returned columns has embedded within a consistent data structure populated with the preceding layer of transformations along with associated properties in a fashion comparable to inverse of the forward pass mirror. As this inverse tree is populated, the depth of transformations for each branch to recover source columns is calculated, and further information is passed through the tiers identifying those paths with the most information retention through transformations as well as confirming availability of inversion functions. The rule for selecting a path of inversion may be supported by a heuristic that the fewest number of transformations will be the most efficient, and paths are further prioritized based on degree of information recovery and availability of inversion functions associated with a specific transformation path. Because Automunge records each step of forward pass transformations with the addition of a transformation category specific suffix appender to the returned column header, we can distinguish each step of transformation by a unique column header string. The methods are further supported with two new data points for each transformation category available for optional inclusion in the ‘processdict’ data structure (deferring to previous write-ups for elaboration of what is meant by processdict), including an entry confirming degree of information retention of the forward pass transformation as well as a populated transformation function of inversion corresponding to the forward pass transformations of that entry. Note that the inversion function may make use of corresponding train set properties that were evaluated during an automunge(.) call and saved in a normalization dictionary associated with a given forward pass transformation. The inversion process for each selected path is then performed by following the transformation category entries in an order consistent to the selected inverse mirror tree path, resulting in a recovery of source column formatting for those paths that had inverse transformation functions populated. Included with the returned recovery sets are simple data structures logging the results of the inversion path analysis.

Of course further information always available with our rollout tweets and in the formal READ ME documentation. All inquiries welcome. Patent pending.

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.