No Silver Bullet

TWiML AI Platforms 2019

Published in

Automunge

10 min readOct 8, 2019

I recently had the pleasure to attend a conference hosted by the This Week in Machine Learning (aka TWiML) podcast themed around AI platforms, the inaugural of these conferences in fact. There’s something to be said for learning from the experience of practitioners who are putting methods into production and dealing with real world implementation challenges vs academics, and there was no shortage of these type of experts on hand. Held on the UCSF campus in San Francisco, the venue was sort of home turf for the tech enterprises already deeply engaged in the high stakes arena of applied machine learning, and it was to be honest perhaps a bit intimidating to be representing a bootstrapping startup in this field of Fortune 500 and Fortune 100 behemoths (there was a certain palatable dismissiveness emanating from a select few of these salary slaves towards ‘vendors’ that kind of made me chuckle a little, they really don’t know what it’s like to live). I went with the hope of networking (which to be honest I’m like really really bad at) and ideally drawing the interest of a few early users, and failing that at a minimum to catch up on current events with respect to potential future competitors. This essay will primarily be inspired by the conference keynote session, and heck since part of the goal is bootstrap marketing I certainly intend to share a little about Automunge along the way. Oh and fair warning I usually like to include a soundtrack with these essays, don’t worry it’s not weird because I say it isn’t. So without further ado.

Gloria — Them

The conference started with a keynote (as these things generally do) featuring an interview of noted researcher Andrew Ng. Ng has had lots of contributions, but I suspect one of his most noteworthy may prove over time to be his Coursera class which introduced the fundamentals of machine learning to a sizable portion of early practitioners (present company included). Some of my takeaways included his discussion of the traditional machine learning workflow, such as a practitioner coding and building models during the day and then allowing them to train overnight, I also found it an interesting analogy when he compared the machine learning workflow as having similarities to traditional software debugging. It was through this discussion of the workflow that he noted the paper “No Silver Bullet” by Frederick Brooks (of The Mythical Man-Month fame), which he used as a guide in his call for us to identify ways to reduce the complexity of building ML tools, including specifically calling out that we need more tools for editing and _versioning_ of data processing for benchmarking and reproducibility. Of course it was a delight to hear the keynote speaker making a public call for the adoption of capabilities that are exactly what we’re building with Automunge, more on that to follow, first let’s take a second to take a closer look at Brooks’ paper and see how it might relate to the agenda of this conference.

via “No Silver Bullet” by Frederick Brooks

Brooks’ (1986) paper was built around the premise that the process of software development has certain irreducible complexities, and that order of magnitude gains in the practice are somewhat far-fetched given this reality. He offers that that hardest part of software engineering, and perhaps a primary source of this irreducibility, is found in the processes associated with specification, design, and testing of conceptual constructs, not necessarily the labor of building and testing the fidelity of representation. It’s a really interesting point in that he has the foresight to ask questions of how rule-based programming (a 1980’s precursor to machine learning), or what was available then for image or speech recognition, might be applied to this problem, and well long story short (too late) he didn’t seem extremely optimistic, instead calling for us to seek out those Mozarts amongst us who might have some ingrain talent for producing structures that are faster, smaller, simpler, cleaner, and produced with less effort; apparently he subscribes to the “10x engineer” meme (oh and he says we should pay them well too :).

Within the context of these questions Brooks does offer a few recommendations beyond just hiring smart people (which let’s face it is kind of obvious on it’s own), and I think these are the kind of things that Ng was alluding to in his talk. First I thought it was kind of cool to see Brooks talk about the impact to productivity associated with time-sharing computing, as it seemed somewhat prescient with the impact that GPU’s have had on the machine learning workflow — Brooks described this impact as preserving immediacy, and in fact this goes hand in hand with another of his recommendations, to employ rapid-prototyping for requirements refinement in an iterative approach to software requirements. After all the assumption that one can specify a system in advance, get firm bids for construction, then have built and installed, well in Brooks’ words this is fundamentally wrong and a fallacy. Extending this finding to applying machine learning to the problem, I think this suggests that trying to create a system using modern tech whereby “automatic programming” might be performed by simply creating a statement of problem specifications and generatively developing a solution, well if this is a fallacy for human designers I think barring some superintelligence this might prove the same for our algorithmic brethren.

Here comes the night — Them

While I’m not sure if he explicitly spells this out, I think what Brooks is alluding to here is that the barrier to this kind of solution is largely associated with the need for troubleshooting and repairing of bugs. After all if we could eliminate bugs at the source, during the design phase, well perhaps we wouldn’t need all of this iterative building, perhaps we could simply directly build a solution instead of incrementally growing it out from a designed skeleton. Unfortunately, as Brooks kind of alludes to in the title of the essay, he believes there is “no silver bullet” for dealing with the bugs at their source. Given Ng’s allusion to the machine learning workflow as sharing some similarities to troubleshooting, well I’m interpreting that he has similar beliefs, that there are certain irreducible complexities of the machine learning workflow just like the software development process. In fact turning back to Ng’s recommendations for the need for better tools for data versioning and benchmarking, well it is the kind of capabilities that would directly benefit Brooks’ solution of iterative building.

Anyway while I have your attention, I’d of course be remiss not to try and draw some interest for Automunge. This was a conference for AI platforms after all, and Automunge happens to be exactly that: an open source platform for the preparation of tabular data in the steps that immediately precede the application of machine learning. Of course at it’s core Automunge is a tool for automated normalization and numerical encoding of tabular data but it’s so much more than that — it’s a platform for feature engineering, it’s a platform for data versioning and data processing experiment management, it’s a feature store, it’s a feature importance calculator, it’s a new way of doing things. And it’s all available now as a simple, clean, and compact python library available for pip install. But don’t just base on this essay, check out our READ ME on GitHub for complete formal documentation.

I’ve been publishing essays about Automunge for a while now, so if you want to learn more there’s plenty of reading available (my last essay An intro to Automunge is a good start if you’d like a tutorial for core functionality), so I’ll just briefly highlight a few points that are relevant to the discussions from the conference. First I’d like to call attention to an often overlooked issue with training and validation data preparation: there is a problem with mainstream framework approaches for things like normalization of sets in that transformations are applied based on properties of an entire set prior to splitting training and validation data. This is actually a potential source of data leakage in that properties from the validation set may leak into the training data. Automunge is unique in that we segregate training data from validation data prior to applying transformations to avoid this risk. Another really useful aspect of the tool comes into play with data processing experiments and data versioning. Using the Automunge package, there is no need for a special versioning system for gigabyte+ scale data sets, the only versioning required is for a compact Automunge returned python dictionary containing the steps of transformations and normalization parameters of a particular experiment which can be easily applied to a single saved source data set and/or subsequent data such as for generating predictions. In other words, we have solved the problem of data versioning. Another issue that came up at the conference was that the industry currently has no settled way to track drift of data properties between training data and subsequent data. Automunge now has a new feature just for this purpose: dataset properties from original training data are saved for comparison to subsequent data with the new “driftreport” option. There was certainly a big call from presenters that the industry needs a “feature store”. Guess what, Automunge has exactly that, a library of feature engineering transformations ranging from statistical distribution transformations to text parsing to several options for categorical encodings, time series data (actually our time series stuff is really neat you should check it out), inference of velocity / acceleration / jerk, I mean it’s hard to list them all here, but in short we have a really robust library of feature engineering methods that we are continuing to build out. And what’s really cool is that the platform allows for programming steps of feature engineering applications such as a user can designate “family trees” of transformation sets. No other python library does that! (patent pending) Couple that with built-in methods for automated ML prediction of infill for missing points, automated feature importance evaluation, automated dimensionality reduction, and oh yeah the simplest means for consistent processing of additional data with just a single function call. Automunge: Artificial Learning, Intelligent Machines.

Baby Please Don’t Go — Them

I’ve had a few misgivings at times because to be honest Automunge didn’t start out with what I would consider best in class software design methodology. No systems engineering requirements assessments was performed, it was, well, very iterative in nature in how it was developed. (Just check out some of our essays from last year if you’re curious). And I think the reason we were able to make this work is because we had a very clear boundary for the inputs and outputs that we wanted to achieve, namely the transformation from raw tabular data to numerically encoded sets for direct application of machine learning. And we had a clear idea of mainstream machine learning workflow — partly having Andrew Ng’s Coursera class to thank for that, as well as I bet some of my early forays into Kaggle competitions. So we were never just building for the sake of building, we already had a skeleton of a set of requirements and workflow, and all of the work in the last year has just been the act of iteratively fleshing that out. Along the way I think it’s fair to say that we’ve made some material improvements to the traditional python-based workflow for preparation of tabular data for machine learning. We’ve created the skeleton for reproducible feature engineering transformations, enabling things like data versioning for preprocessing experiments. We’ve created a “feature store” of feature engineering transformation methods. We’ve solved an unmet need for tracking data property drift built into the original processing for training data. And we’ve done all of this in the context of a simple and compact open source python library. Now we’re just trying to get some early users to experiment with the tool. We welcome feedback, we welcome requests for specific types of feature engineering methods to build into our library, we welcome users. This whole project started as a trust-building exercise; there is no attempt at user lock-in, all of the parameters and steps of transformations are returned in an open and unencrypted python dictionary. The hope is that we can meet an unmet need for data scientists, and who knows perhaps down the road we can build a company off of adjacent services. That part is still to come, but for now Automunge is in a very useful and well tested state. We invite all participants of the TWiML AI Platforms 2019 conference to try it out. Cheers.