Private Encodings with Automunge

Introducing anonymized dataframes

Published in

Automunge

6 min readOct 6, 2021

For those that haven’t been following along, I’ve been using this forum to document the development of Automunge, an open source python library for preparing tabular data for machine learning. The library’s interface is channeled through two master functions: automunge(.) to prepare training data for machine learning, and postmunge(.) to consistently prepare additional corresponding data on the training set basis. When data is prepared, the automunge(.) function returns a series of dataframes for use as training data, validation data, or test data — each grouping further segregated into separate dataframes for features, ID, and labels. automunge(.) also populates and returns a compact python dictionary (the postprocess_dict) recording steps and parameters of data transformations, which dictionary may then serve as a key for preparing corresponding data in the postmunge(.) function. The data preparations may be minimal, such as numeric normalization, categoric binarizations, and missing data infill under automation, or the tool may also be applied as a platform for custom engineered pipelines of univariate transformations. There is an extensive internal library of data transformations, which may even be mixed with custom defined data transformations in sets that may include generations and branches of derivations. The resulting returned dataframes log the transformations applied by way of suffix appenders on the returned column headers.

This essay is to introduce the new privacy_encode option for improved privacy preservation in the prepared encodings.

Being built on top of the Pandas library, Automunge is a dataframe centric platform. As a quick hand wave for outsiders, dataframes are a convention for representing tables of data with columns and rows, which differ from raw data (like what is provided with Numpy arrays) in that built into the structure is support for column headers and indexes. Column headers can be strings or numbers that serve as an address for accessing specific columns. Indexes can be sequential integers, non sequential integers, or even less structured data like strings or floats, which can serve as an address for accessing specific rows. In some cases dataframes can even be populated with multiple index columns for accessing rows in multiple ways. Machine learning operations like backpropagation and inference usually ignore these dataframe centric addresses and treat dataframes comparable to arrays, although some learning libraries may support use of column headers to do things like designate categoric vs numeric features for special treatment.

Automunge has extended traditional dataframe conventions for column headers and index columns. When training data is prepared, the application of various transformations are logged on the returned column headers by way of (generally 4 character) string suffix appenders with leading underscore, resulting for example with an input column ‘inputcolumn’ could be returned as something like ‘inputcolumn_sfix’, where ‘sfix’ corresponds to the transformation category associated with a derivation. This obviously results in a tradeoff with loss of validity of original column headers, we resolve this by including a map between an input column and a list of returned columns, where for this example if we didn’t know the returned column configuration in the encoded dataframe ‘train’, we could access as train[postprocess_dict[‘column_map’][‘inputcolumn’]]. (The returned columns are further mapped in the columntype_report by classification of contents — e.g. distinguishing between types like continuous floats vs categoric integers or aggregated sets thereof.)

Automunge also has some special treatment for index columns, such that non ranged integer index columns are extracted from the training data and returned in the separate ID sets supplemented by a generated ranged integer index returned as ‘Automunge_index’ (ranged integer refers to i.e. 0, 1, 2, …). When any shuffling operation or validation set partitioning is performed, the feature, ID, and labels sets are all conducted on same basis to maintain row correspondence. Thus even though the returned dataframes omit any special index conventions, those indexes can be recovered when user has access to the ID sets.

The privacy_encode option was developed with the thought that there may be scenarios where a user would benefit from dataframe encoded data while still enjoying the privacy aspects of arrays (which could be considered more private since they omit things like column headers and indexes). After all when serving data as arrays, a user sacrifices several dataframe unique aspects like column specific data types, index navigation, support for Pandas operations, and of course an inversion option as is available with postmunge(.) to recover the form of data prior to encodings. Since the data returned from automunge(.) already has normalized numeric features and binarized categoric features, masking columns and rows helps to disguise the features serving as basis for a machine learning model.

Private encodings are now available by simple activation of the automunge(.) privacy_encode parameter, which results in consistent anonymized form for data prepared in automunge(.) as well as additional data prepared in postmunge(.). There are three tiers of encodings available, in summary:

privacy_encode =

False : public column headers for all sets, no shuffling
True : private feature headers, public label and ID headers, columns shuffled
‘private’ : private feature and label headers, public ID headers, columns and rows shuffled

Of course the big asterisk remains that all of these anonymizations can be circumvented with access to the postprocess_dict dictionary, including column header information as well as the ability to invert an encoded form. (Since the postprocess_dict is used to prepare additional data in postmunge(.) it includes all conversion information.) In order to close this back door, we’ve introduced a new option to encrypt the postprocess_dict using AES encryption with the pycrypto library — which requires install to operate. (A thank you owed to a tutorial from pythonprogramming.net for helping to get us started.)

Encryption can be activated by the automunge(.) encrypt_key parameter, which accepts values as either integers 16/24/32 (bigger means stronger encryption) or otherwise as a bytes object of comparable length for a custom encryption key. When passed as an integer an encryption key is generated and returned in the closing automunge(.) printouts. The displayed encryption key should be copied and saved by the user as it will be needed to prepare any additional data in postmunge(.) by way of passing that key to the postmunge(.) encrypt_key parameter.

There are a few subtleties to highlight for the encryption scenario. One is that in some cases not all of the postprocess_dict is encrypted, where few public entries are provided to support data navigation, such as an anonymized version of the columntype_report. For the full privacy scenario (privacy_encode = ‘private’) even these public entries are encrypted. We also have a special convention for label sets, such that except for the full privacy scenario, postmunge(.) inversion can be applied to recover the input form without an encryption key. This label inversion can be applied for example to the predictions returned from an inference operation.

We expect that anonymized dataframes will provide many benefits in any industry where data privacy is an important consideration, as now Bob can evaluate machine learning independent of feature property identification, with Alice able to recover the information only when needed.

For further readings please check out A Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

Debussy’s Estampes — Nicholas Teague

Private Encodings with Automunge

Introducing anonymized dataframes

Written by Nicholas Teague