READ ME

I’ve tried being subtle

Nicholas Teague

Published in

Automunge

5 min readJan 8, 2020

Phase One

Hello there! Thank you for clicking on this post. My name is Nicholas Teague. Last year I quit my job so that I could devote full time focus to Automunge, a Florida startup building an open source platform for data scientists to automate a sizable chunk of the data preparation steps for machine learning pipelines that incorporate tabular data. How might you ask does an open source software project plan to make money? Well that’s a good question. The hope is that by providing as much value as we can for the machine learning community in the context of an open source package, it will facilitate visibility and trust such that future commercial offerings for data processing (such as might make use of external computing resources) will have a clear path to market — that’s phase two though, right now still focused on phase one. Since we are a fully bootstrapped enterprise we have a little more flexibility for playing the long game so to speak. That is not to say that we are not pursuing any revenue models, hey if you like what you see and want to show some support maybe you can buy a book or album from our music shop or book store, seriously check it out: Vinyl Records for Fun and Profit / Recommended Further Reading.

If you’d like to learn more about our software probably the best place to get started is the formal documentation available in our READ ME on GitHub. It’s admittedly a little dry, I mean that’s how software documentation is supposed to be written, just trying to be normal. We’ve also put together a few demonstration notebooks, for example if you’d like a walkthrough of implementation you can check out Automunge in the cloud (currently a Colaboratory demonstration notebook for which a user can upload your own data sets to try out if you like). Or another good starting point might be the essay An intro to Automunge which demonstrates several of the optional parameters and gives a good intuition for basic operation. Oh and of course you can always check out the book of essays, have been documenting the whole journey via a somewhat regular publishing schedule (along with various creative interests), I mean don’t want to overwhelm you or anything, several ‘not dry’ essays in our blog — it’s a lot of fun, seriously if you ever have some spare time check it out: the Automunge essays are collected in the Automunge publication / or for the full collections of essays, some a little more creative, check out the book From the Diaries of John Henry.

I’m So Tired — Nicholas Teague 01/07/20

Phase Two

To give you an idea of just what is possible with the Automunge software, let’s talk about the “feature store” aspects. I mean at its core, Automunge is a platform for preparing tabular data for machine learning, including the application of infill (such as to address missing or improperly formatted data) and feature engineering transformations (such as to allow for more efficient extraction of properties by machine learning algorithms). The feature engineering transformations may be for the simple purpose of numerical encoding — such as the minimal preparations necessary for application of machine learning (which may be performed under automation), or alternatively may be designated by the user — including potentially sets of transformations with generations and branches, such as to present feature sets to the machine learning algorithm in multiple configurations — what we call “artificial learning.”

To give you an idea of what kind of feature engineering transformations we’re talking about, I recently sort of reorganized the READ ME file’s Library of Transformations section such as to make a little easier to navigate, and in the process I think helpfully identified a few core categories of transformations that can be applied. For example, we’ve aggregated feature engineering transformations intended for numerical sets between several distinctions, such as Normalizations (to scale data within a designated range like min-max scaling, mean scaling, z-score normalization, etc), Transformations (things like log transform, power law transform, etc), and what we call Bins and Grainings (think aggregations by number of standard deviations from the mean, powers of ten, or designated fixed width and fixed population bins, stuff like that). Oh and also some numerical set methods intended specifically for Sequential Data such as velocity/acceleration and stuff. It’s all very useful. And well in the interest of brevity I just want to close by offering that it would be a real shame if all of this really useful stuff didn’t get to benefit anyone. I mean to be honest we really don’t have any substantial user base growth to speak of. And I think a lot you folks could get some real use out of this stuff. So would you do me a favor and like just try it out? Please? (You’re just agreeing with me so I’ll shut up.)