Getting Creative With Constraints

No one said it was going to be easy

Nicholas Teague
Nov 13, 2019 · 10 min read
This is what quitting your job to start a company looks like.

There’s a tremendous kind of cognitive dissonance that comes into play in early stage entrepreneurship. Lacking the slightest bit of validation from a user base, the Automunge project has formed this intense kind of (non-quantum) superposition in my expectations, whereby it seems wholly within the realm of possibility that it could make me a millionaire. Or homeless.

I’ve Got a Feeling — The Beatles

Paul Graham has written an essay advising entrepreneurs to take stock of their circumstances and recognize whether their startup should be considered default alive or default dead. In other words, on their current trajectory, considering spend rate and revenue growth, will they make it to profitability on the money they have left? Well expectations aside, the reality of circumstance is that Automunge is strongly in the camp of default dead at the moment.

I mean don’t get me wrong, one of the strengths of the project so far is just how far we’ve come with so little resources, our biggest expense most of the way has been buying books and stuff. I think it says a lot that we’ve accomplished all of this without taking a penny of investment, and without a single hire. To show for it we have a robust software package with novel functionality meeting several unmet needs for data scientists in the machine learning space that, let’s be real, is going to transform our entire economy within the next decade. We have staked some claims on potentially valuable intellectual property. Oh and we’ve also self-published online a book of essays along the way in the book From the Diaries of John Henry.

That’s not to say we haven’t had any expenses. Heck just a few weeks ago we invested in our first GPU powered data science rig (basically a gaming PC with a good GPU and some extra RAM repurposed for training neural networks). I had high hopes for this PC, first of all to give me some needed simplicity in managing the complexities of running data processing experiments on large batch sets in the context of Kaggle competitions, but also because was starting to notice a little slowdown in my iPython / Jupyter notebook development sessions on my MacBook, and was hoping that a little beefier processor and RAM might take right care of that. I was a little surprised (and somewhat disheartened considering the expense) to discover that in fact the PC notebooks were even less functional. Where on the MacBook I had the inconvenience of a little slowdown, on PC the ability to navigate and reasonably edit (my admittedly somewhat ridiculously large 20,000+ row) cells were fully impaired. Not sure if this difference has to do with operating system or browser (albeit issue persisted across different browsers), but was certainly a notable result. (To be fair I don’t think anyone ever intended Jupyter notebook cells to exceed 20,000 lines of python, probably going to need to graduate to a more mature code editor at some point before much longer.)

We also had another investment hit our books by way of an attempt at bootstrap marketing via attendance at the recent, and inaugural, AI Platforms conference hosted by the This Week in Machine Learning podcast. I’m probably guilty of not putting enough effort at building a professional network — kind of starting over with a recent relocation to Central Florida didn’t help. The conference partly was an attempt at rectifying this circumstance which wasn’t a total failure, did successfully walk away with the business cards of three new contacts after all. I mean I’ve said it before, networking is not really my strength, the hope has always been that these type of writings can contribute to building a network by proxy. Which was actually the big deliverable from the conference, a whole new essay inspired by the keynote interview of Andrew Ng (based on his references to the Fred Brooks essay No Silver Bullet). I was really happy with how the essay turned out, I mean usually I’m happy if I don’t accidentally like severely offend some whole group of people in the process, the old foot in the mouth sometimes catches up with me :). But yeah if you’d like to check it out it’s available here:

I recognize that when you consider cost of travel and lodging (a really swanky AirBnB but I won’t get into that), the return on investment of a for-free creative writing project, now with up to seven readers, well it kind of begs the question “is this an appropriate use of resources?” I’ve had a long and hard talk about this with my financial team in the mirror and well the way I look at it even if nothing comes of these contacts or the essay for that matter, the validation of the project was quite valuable. I not only came away from the conference with reassurance that what I was working on was valuable and unmet by industry, heck I even introduced a new feature based on some of the discussions. Rolled out in version 2.62 was a new option for the postmunge(.) function via the “driftreport” parameter to report drift of data set properties between original training data and subsequent data (such as that data intended to generate predictions from a trained model). The implementation is fairly straight forward, as we were already tracking and storing data set properties of the training data, all that was really needed was some comparable extraction of properties for subsequent data using previously defined functions for comparison. There is a slight bit of complexity originating from fact that based on properties of the data there may be different columns returned (such as in a one-hot encoding for instance), we have what I believe is an elegant solution by way of reporting properties not based on the returned columns but aggregated by the original pre-transform columns. We currently offer a printout report of the assessment, the intent is to return a version of this report in one of the returned objects from postmunge (I already have a thought about how am going to do this, just waiting until I have some time to focus because want to make sure I get it right — will probably end up also returning the optional aggregated postmunge feature importance evaluation results in the same object, kind of constrained by backwards compatibility but no worries have a plan).

Speaking of constraints, another constraint I wrote about when documenting my recent Kaggle competition entry to the IEEE fraud detection competition in the essay “isFraud?” (How’d you like that segway, pretty smooth right?) One of the challenges of this Kaggle competition was that the size of the training and test data sets, several gigabytes, started to push the evaluation towards the limits of in-memory operations. For example I generally found that at about 1,000 feature columns and 500,000 rows the hyperparameter tuning with grid search / early stopping could only handle around 20% of the data at a time, although for final model training it could handle the whole set. But then when I increased the number of features (by way of additional supplemental features generated with Automunge), well at about 2,300 feature columns (the most I tried) was only able to run final model training on about half of the training data at a time. Funny enough even when trained on only half of the data this expanded set of features was one of my better efforts, interesting tradeoffs between number of features and memory constraints. Generating predictions on the test set turned out to have similar issues, in that generating predictions (especially with xgboost) for the entire test data set at the same time ran into constraints. I ended up running an iterative method in which I iterated through chunks of the raw data set and prepared those chunks for generating predictions with postmunge(.) followed by running predictions with the trained model one chunk at a time. Because the returned set of predictions for each chunk were much smaller size (just a single column) I could thus aggregate the predictions in memory while only carrying the input chunks one at a time. I’ll show a code sample here to demonstrate. Note that a version of this iterating through chunks can be repurposed for Automunge to prepare training data, which may be beneficial given the difference in processing time between automunge(.) and postmunge(.) — the automunge column evaluation functions carry some overhead, so one strategy is to populate the postprocess_dict with automunge(.) using just one subset of the training data and then run the rest of the training data through postmunge(.) to consistently prepare.

Since making some awkward transitions here I’ll just totally change directions and talk about another update rolled out recently (hey these essays can’t all be a work of art). I recently rewatched some of the original Andrew Ng lectures from his Stanford Machine Learning course, and turns out when he made recommendations for numerical set feature scaling his suggestion for default was actually different than what we have been using. More specifically, we have been basing our default numerical scaling so far as category ‘nmbr’ for z-score normalization, in other words x’ = (x — mean) / (standard deviation). We also have options for min-max scaling as category ‘mnmx’ e.g. x’ = (x — min) / (max — min). Turns out Ng had kind of recommended a cross between z-score and min-max called “mean scaling”, although he did note that z-score was also a suitable option. We now have rolled out a new category for numerical mean scaling as ‘mean’ where x’ = (x — mean) / (max — min). Basically, mean scaling and min-max scaling both retain the distribution properties while shifting to a designated range, where in mean scaling +/- values are returned and in min-max scaling the distribution is fit within the range 0–1. Z-score normalization may also have +/- values and their range will possibly be a little wider than mean scaling. Really up to a user preference. If a user wants to apply mean scaling instead of z-score normalization you can either assign distinct columns to ‘mean’ in the “assigncat” object passed to automunge(.), or if you prefer ‘mean’ as the default all you have to do is repopulate the “transformdict” family tree primitives for ‘nmbr’ for basis of ‘mean’ instead of ‘nmbr’, here I’ll demonstrate.

Of course the goal here is not just to overwrite default numerical feature scaling, it’s also to overwrite default startup status from default dead to default alive. (See what I did there? I know this writing deserves a pulitzer I agree.) I’ve drawn from the essays of Paul Graham a few times in this book. As one of the original founders of Y-Combinator, an “Ivy League” startup accelerator so to speak, his writings have proven a valuable window into the mindset and discipline of early stage entrepreneurship, and I’ve learned a few valuable lessons from this resource. This was actually part of the inspiration for enrolling in their “Startup School” program, a ten-week series including lectures and even group sessions in which you get to network with other entrepreneurs via video chat. It was certainly reassuring to see who my ‘competition’ is for VC funding :). I’m afraid that was my biggest takeaway though. Was not invited for an interview at YC.

I’ve Got a Feeling — Nicholas Teague, music by John Lennon and Paul McCartney

Given all of these constraints and let’s be honest lack of even the slightest bit of traction in establishing interest from a user base, well it kind of begs the question “where do we go from here?” I’ll be honest I’m kind of biased with respect to next steps because between all of this coding and writing I happen to be having the time of my life. I can sleep in as late as I want, I can work from wherever the whim takes me whether a coffee shop or state park or really wherever. I’m reading more, I’m writing more, walking more, playing more piano, really I’ll take this over wage slavery any day of the week. I mean eventually I’m going to need a business model of course, but that’s coming. I do have some further validation to report, I was recently invited to attend the NeurIPS conference in Vancouver, which I’m really excited about. It’s kind of ground zero for machine learning researchers and not everybody gets in. So yeah that’s pretty exciting. Who knows perhaps I’ll even get like a business card or two or even an essay out of it. Anyhoo, until next time. Ciao.


The flashing colored lights are how you can tell this is a very serious rig for a very serious professional.


Albums that were referenced here or otherwise inspired this post:

Let It Be — The Beatles

Let It Be

As an Amazon Associate I earn from qualifying purchases.



For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

Automunge

Automunge — Data Prep for Machine Learning

Nicholas Teague

Written by

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com

Automunge

Automunge

Automunge — Data Prep for Machine Learning

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade