Faster Data Pre-Processing for Machine Learning

It’s not a quantum speedup, but hey it’s a start

Published in

Automunge

11 min readSep 28, 2018

Stevie Wonder — Uptight

For those that haven’t been following along, I’ve been using this forum over the last few weeks to develop some algorithms for automating the data wrangling process for structured data sets that generally serves as a precursor to application of machine learning algorithms. This post includes links at the conclusion for the full series amongst a few other tidbits. Along the way we’ve developed a tool we call automunge, which takes as input a pandas dataframe training set (which we’ll refer to hereon as the “train” set) intended for use to train a machine learning model, along with a comparable test data set (which we’ll refer to hereon as the “test” set) which is intended to be used to generate predictions from that same model. The data in these two sets is consistently processed into numpy (Python’s numerical framework) arrays such as to make them suitable for direct application to a machine learning algorithm in a python framework of your choice, including scaling and normalization of a column’s numerical or time series data, binary encoding of a column’s dual categorical data, and one-hot encoding of a column’s categorical data. For cases of cells in our dataframe with missing values, a user has the option of either infilling plug values using standard methods (such as mean value for a numerical set, most common value for a binary set, or creating a new category for missing values in a categorical set) or alternatively deriving infill for these missing cells using machine learning models trained from the rest of the data set. The automunge function also derives a python dictionary called postprocess_dict which can be fed into a separate function we call postmunge for the purpose of processing subsequent test data that was not available during the initial processing of the train set. Note that if no test data is available for the initial application of automunge we can still proceed by using a dummy test set with consistent column naming convention as the train set.

In my last post we attempted some validations for the speedup associated with the use of the postmunge tool to process a test set in isolation in comparison to the simultaneous processing of both a train and test set via the automunge tool. We determined that this type of speedup was a worthwhile goal given the investment of development resources (read ‘free time’) and code complexity, and indicated that some further improvement to processing efficiency were expected as likely feasible given some revisions to machine learning model frameworks and methods of data set iterations. This post will address the second of these two ideas, methods for iteration, and a key finding is that by trading the pandas dataframe .iterrows() call for a few other methods throughout such as .apply(), .loc(), masks, and etc we were able to replace the various calls of iterrows iteration with vectorized methods that trade serial processing for parallel and thus facilitate material speedups. I won’t dive into the full architecture of the automunge tool here, but will note that there were two primary culprits for these iteration methods, one a defined function we called evalcategory(.) which was used in evaluation of each column in our dataset to determine appropriate processing method and two a defined function we called insertinfill(.) which was used to insert infill into missing cells in the train and test data. I’ll provide here some demonstrations of the type of changes that were made to facilitate speedup in our update from version 1.2 to 1.3. There is also a companion Colaboratory notebook available here.

excerpt from evalcategory(.) function demonstrating speedup via vectorized method

excerpt from insertinfill(.) function demonstrating speedup via vectorized method

After incorporating these and a few other updates we repeated the validation runs that were implemented in our last post, serving to measure processing time for the application of automunge and postmunge on some datasets of increasing scale for the delta between versions 1.2 and 1.3.

validation of speedup via processing time delta from version 1.2 to 1.3

Tedeshi Trucks Band — Bound for Glory

So clearly these updates had some material impact on the processing efficiency and thus I believe the address was a success. I expect there are probably some other steps that could be taken for further speedup but I will leave that for future iterations. Before wrapping up though I’ll take the opportunity for a couple of tangents, I hope you’ll humor me by sticking around for another few paragraphs. First I’ll offer that even though the code update demonstrations provided above were fairly succinct and in some cases only a line or two, I think it would be a mistake to characterize this as just a few simple edits. I’m going to provide the following for color, and because I think it gives a window into the type of troubleshooting effort that sometimes goes into software development that may not be obvious to the end user. (I don’t consider myself a Python expert, am kind of picking this stuff up as go along — that probably contributes here as well.) This is a fairly long excerpt and I’m not expecting you to read all of it, just want to demonstrate what kind of effort may go into just a few simple line updates. The essay continues below.

demonstration of troubleshooting process via illustrative excerpt

One thing I’m not trying to claim is that the development of this tool was just conjured into existence. The ideas here owe a lot to several sources that I’ll try to elaborate on now. First note that my book recommendations page has always been intended as a kind of list of citations for these essays. Some of the particularly helpful books for this development include Hands-On Machine Learning with Scikit-Learn and TensorFlow, Deep Learning With Python, Python Machine Learning, Data Science From Scratch, Python for Data Analysis, Data Wrangling with Python, and Pandas Cookbook all come to mind. For this software development process however, the single most important resource wasn’t a book or a paper, it was the website Stack Overflow. A crowd-sourced question and answer service meant to match interesting coding problems with experts sharing their knowledge, the site has reached a critical threshold where I believe it has for much of the mainstream applications successfully bridged the gap between natural language query and formal coding. I believe that eventually software developers will have the ability to interact with their creations in a fashion much closer to natural language, and I see no better resource to help that come about than the big data of this site. For now, if a coder has the ability to formulate a problem in natural language incorporating keywords of fundamental aspects of the language, there is a good chance that a Google search will reveal an appropriate dialogue from Stack Overflow resolving their issue. Obviously the official Python and Pandas documentation is also a suitable resource, but in truth I found the Stack Overflow tool even more helpful than the official documentation.

I’m partly including this discussion of citations based on an interesting debate I saw on Twitter recently between two notable machine learning researchers. Apparently there is one school of thought that the primary focus of research publishings should be to impart information, and that legacy citation protocol in formal academic writings does nothing to contribute to this purpose, and thus hyperlinks are not only sufficient but actually superior than some plaintext sources that nobody reads. The other school, advocating for traditional academic citation protocol, is meant to address the fact that these formal citations actually serve a purpose beyond just the communication of the research — academic citations are tracked by automated tools and can materially contribute to a researcher’s opportunities, in domains such as job placement or even immigration. I definitely consider myself an outsider to academia, and have never tried to formalize any of these essays to adhere to these kinds of formal research protocol, however I want to be clear that I have always considered my book recommendations as a kind of citation list for these writings, and those few academic papers that otherwise contributed to these writings I have tried to include hyperlinks and citation information embedded in the essays.

Tedeshi Trucks Band — Everybody’s Talkin’

On another tangent, I want to talk a little too about the use of this automunge tool, including the intentions of the current formation as well as where it might be headed. Although I see potential to fully automate the practice of data wrangling, the current iteration has a ways to go before we can make that claim. The practice of feature engineering to facilitate learning goes well beyond simple normalization protocols and categorical encodings. There are a whole range of features that can be extracted from a numerical set prior to training, whether they be simple features like mean, median, minimum, maximum, or more complex sets of distribution parameters. I expect it may even be beneficial in some cases to translate between an inferred parameterized distribution and one more tractable for application of neural networks — something I expect Mathematica may be able to help with. I have touched on this in some of the prior posts, our categorical encodings are only evaluating via distinct string character groupings, and some further feature engineering will certainly be possible by an analysis of string character subsets. Categories could potentially be grouped based on inferred meaning encodings whether those be sentiment or language or countless other types of inferences from text. Unless we want our model to relearn the english language, it just doesn’t make sense not to take advantage of other resources and API’s to extract features from our string character groupings. The automation of that process I would like to be a future aspect of this tool. Obviously this tool is not meant to address image, video, or audio data in this current iteration, I’ll have to give some thought into whether there is some potential there.

So given all of these feature engineering aspects that are currently beyond the reach of automunge, can this still be a useful tool for wide adoption? My argument would be yes, based on the following. It doesn’t matter what type of feature engineering a data scientist performs, in the end it should still be beneficial to normalize the numerical data and encode the categorical. It should be beneficial to address missing values in a cell. These applications are fundamental and universal to preparation of structured data for application of machine learning. Thus automunge in its current iteration is not a replacement for feature engineering, however it is suitable as a simplification for the final steps of feature engineering.

The inclusion of the ML infill technique, which I have talked about in several of the prior essays of this series, I believe is a novel innovation and potentially of wide benefit to the industry. Unlike standard infill techniques, ML infill adapts the infill to the characteristics of the dataset, using the context of the adjacent cells in the same row to predict a suitable value in an automated fashion. This technique as presented works both for numerical or categorical data making use of one-hot encoding. It works both for the initial train and test data as well as for any subsequent test data not available at the time of training. In the traditional methods, arbitrary infills (such as a numerical set’s mean) present an obstacle to efficient learning. By inferring the infill value we are lowering the barriers to efficient learning and my expectation is that further validation of this technique will bear this out.

Tedeshi Trucks Band with Herbie Hancock — Space Captain

I’ve been writing these essays with some frequency for over two years now, including address of high tech and creative writing. I’m not very good at marketing. I initially assumed that eventually, if the content is of suitable quality, someone might read it. I’ve gotten past that. The practice of writing is about more than just drawing an audience. It’s its own reward. I find that in putting pen to paper I am better able to organize my thoughts. If you follow the progression in the essays over these years I think it is clear the practice has made me a better thinker.

I was browsing Google’s search algorithm guidelines (“Search Quality Rating Overview”) the other day (a goldmine if you’ve ever had any interest in SEO), and assuming other sites such as Medium follow a similar approach, I think it hints a little at what might be going on here. Google’s algorithms reward content with clarity of purpose and consistency of address. By promoting in their search rankings content of a certain style, that which allows for the most efficient organized information (their mission after all), I would offer that they are in a way promoting the web as a kind of encyclopedia, every entry a silo’d entity addressing a distinct theme, progressively organized into distinct domains by themes of ever increasing granularity.

Here’s my proposal to Google or any other portal that’s listening. Consider facilitating not just connections between people and information, but how about connecting people with other people. If you want to train your machines to think like people, you’re going to have to allow them to interact with information like people interact with information. People go on tangents. People get distracted. Sometimes people ramble and don’t even know if they have a point. The coalescing of an idea or thought into a succinct organized bit of information is a messy process, just consider the act of troubleshooting some software code. In the messiness lies the creativity. An annealing optimization starts from a state of high disorder. The Google mission to organize the world’s information, taken literally, is a mission for overfit.

The human mind … operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. … Selection by association, rather than indexing, may yet be mechanized. — Dr. Vannevar Bush

I would argue that the web as we now know it has reached this state of overfit. We are no longer interfacing with human creativity for it’s own sake, we are now exposed in mass to content competitively sourced to game the algorithms of distribution.

I recall some commentary by the author Nassim Taleb on the benefits of walking outdoors, how the high dimensional bumpiness of an unpaved trail imparts stressors and distractions, which when compared to the monotony of asphalt is analogous to the difference between strength training with free weights verses those single motion contraptions offered by commercial gyms. The internet, or literature for that matter, is at it’s best when it catches you off guard. When themes are layered. When intersections are explored. When what started as a dry announcement of some boring software update turns into a musical journey addressing themes of interactions and overfit. Human thought is fractal. I see no reason why the web we interact with can’t be as well.

Go Astros. (It might be time to upgrade my camera phone.)

Previous entries in this series:

Wrangling for Smart People

Keep it simple stupid

medium.com

Joe Cocker — She Came in Through the Bathroom Window (live)

Automunge

In which we keep on wranglin’

medium.com

The Flaming Lips with Miley Cyrus & New Fumes — A Day in the Life

It’s Only a Munger of Time

Wranglin’ the Night Away

medium.com

Playing for Change (various artists) — Ripple

Automated Data Wrangling with Machine Learning Derived Infill for Missing Values

The shiny new thing

medium.com

Louis Armstrong — St James Infirmary

Predictions and Encodings

From a blog about nothing

medium.com

Grateful Dead — Blow Away (Philadelphia 7/7/89)

ML Infill for Data Wrangling

Attempting validation for an automation innovation

medium.com

Kid Koala — Moon River

Munging in the Real World

In which we graduate from the Kaggle workflow

medium.com

Jerry Garcia & Mel Saunders — It Takes a Lot to Laugh, It Takes a Train to Cry

Tom Petty — I Won’t Back Down (The Swamp version)

Books that were referenced here or otherwise inspired this post:

Essays — Michel de Montaigne

(As an Amazon Associate I earn from qualifying purchases.)

*For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on AutoMunge:

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share — or I don’t know, consider hiring me or something :). I can also be reached on linkedin for professional inquiries or twitter for personal.

For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on AutoMunge:

Faster Data Pre-Processing for Machine Learning

It’s not a quantum speedup, but hey it’s a start

Previous entries in this series:

Wrangling for Smart People

Keep it simple stupid

Automunge

In which we keep on wranglin’

It’s Only a Munger of Time

Wranglin’ the Night Away

Automated Data Wrangling with Machine Learning Derived Infill for Missing Values

The shiny new thing

Predictions and Encodings

From a blog about nothing

ML Infill for Data Wrangling

Attempting validation for an automation innovation

Munging in the Real World

In which we graduate from the Kaggle workflow

Essays — Michel de Montaigne

*For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on AutoMunge:

Written by Nicholas Teague