Learning to Learn with Automunge v2.0

The Ramblings of the Midnight Rambler

Published in

Automunge

15 min readMay 22, 2019

St Jerome in a Cave, Unknown artist, 16th century

For those who haven’t been following along, I’ve been using this forum in recent months to document the development of Automunge, an open source Python class which can be used to automate and/or simplify the final steps of processing tabular data prior to the application of machine learning, all with a simple push-button operation. The tool performs the numerical encoding and feature engineering of data that is the traditional purview of data scientists, returning a series of sets suitable for a generic machine learning evaluation which can be directly fed into the machine learning framework of your choice. Taking as input a pandas dataframe of tidy data (single column per feature and single row per observation) a user has choices between automated processing based on inferred properties of the data or column specific specification of feature engineering methods. In addition to the feature engineering transformations, Automunge also allows automated infill of missing data points in a set with the use of machine learning predictions in a fully generalized and automated fashion, what we call ML infill. Further, Automunge is more than just a static tool, it is intended as a platform allowing a user to build on simple data structures to incorporate their own custom feature engineering transformation functions while still taking advantage of the extremely useful built in functionalities for feature importance evaluation, infill, dimensionality reduction, oversampling, and most importantly the fully consistent processing of subsequently available data with just a simple function call. In short, we make machine learning easy.

After taking a week off of our (traditionally somewhat hectic) publishing schedule to focus on software development, this week’s update to version 2.0 will have a few more new features than usual I suspect, which we’ll use this essay to document. In the interest of keeping the interest of a reader interested, I’ll also use this essay as an opportunity to share a few thoughts in response to the book Robot-Proof by Joseph Aoun, addressing the implications of emerging technologies in machine learning to our higher education system. To be honest the book sort of reminded me of David Egger’s A Heartbreaking Work of Staggering Genius in that the opening chapters were really the standout points and then after that it was kind of uneven, so I’ll probably spend more time on those aspects that kept my attention (hey writing a book is hard!). The book selection was partly inspired by a neat concept of a recently joined kind of BYO book club in which everyone picks their own work to share around some theme. I’m a touch removed from my college days, but I’ve always looked back fondly at the experience and was excited to get to revisit the education system from a different perspective. The questions raised throughout the book are pretty fundamental to the future of higher education. How can or should we conduct higher education in the context of the progressive automation of knowledge work? Under what domains will humans retain some advantage over our robot brethren, and how can we strengthen those attributes in our students? This essay will tackle these questions partly from drawing on Aoun’s writing and also by a few of my own musings. Oh and probably worth a quick note that I suspect I’m going to be preoccupied for a little while so may not be posting with as much frequency in coming weeks. Cool well introductions complete, without further ado.

Muddy Waters & The Rolling Stones — Baby Please Don’t Go

The first update of the week was inspired by the workflow of our last demonstration notebook. We have been using a public dataset associated with the Boston housing market, mostly because it’s just an arbitrary easily downloadable tabular set which can be imported from Keras. The way we had structured our tool was such that automunge takes as input two pandas dataframe sets, a “train” set intended for use to subsequently train a predictive model, and a “test” set intended to generate predictions from that model. It is a quirk of automunge that we require any labels (e.g. the target values for training) to be included in the train set with a specified column header, and we assume that any validation set would be carved out from the passed train set. However in working with this Keras set we realized that a user may have a specific validation set in mind, and so they may prefer the ability to pass and consistently process labels to both the train and test sets. Well we found a simple solution with only a mildly complex implementation and as an update a user now have the ability to consistently process labels included as a column in both the train and test sets, both for the original application of automunge(.) function as well as the subsequent processing via the postmunge(.) function. The method requires the consistent header title conventions between train and test sets, and we simply added a new returned set from the functions associated with with the test set labels. This addition of a new returned set is the first time we’ll break backward compatibility, so we’re going to go ahead and call this new update version 2.0. The whole point of this is to keep the software consistent with mainstream workflow, we don’t want to be a solution in search of a problem (e.g. the Segway approach).

image via Facebook/Segway Polo Club of Barbados

Ok segway alert, let’s turn focus now to our interesting book Robot-Proof (see what I did there?). If we are to reconsider how we conduct higher education, it’s probably important that we stay grounded in the reality of the obstacles before us, we don’t want to be proposing solutions in search of a problem. After all while the two-wheel stabilization technology of the Segway may have been totally revolutionary from a locomotion standpoint, it was well overpriced for the very minimal improvement over the alternative of simply walking with our own two legs available to most of us, especially considering that scooters can be had (or now even rented) very very cheaply. In the education domain, consider that college students have always had the option of simply visiting the library, and the chief differentiators of the university system were the signaling aspects associated with college selectivity in admissions, the credentialed gatekeeping reality of the labor market, and the social benefits of immersion with peers. With the advent of the internet, the range of low-cost alternatives is only growing, as students now have access to a host of online MOOC’s (massive open online courses), Youtube lectures, Wikipedia, and heck if we want to network with our peers we can always say hi on twitter. The point is that while the cost basis of a college education continues to climb well above inflation, the value proposition is steadily being chipped away. It will be one of the proposals of this essay that for universities to compete going forward they will need to not only demonstrate ability to prepare students for the new economy, they will need to leverage many of those emerging concepts themselves in their delivery.

The Rolling Stones — Tumbling Dice

Ok let’s turn back to Automunge, going to go with the traditional alternating paragraph / alternating topic format here in case you haven’t picked up on that yet. (It’s a thing because I say it is.) Cool well the next update for Automunge 2.0 was building on some functionality we rolled out in our last essay, specifically the options for dimensionality reductions via principle component analysis. As a quick refresh, we demonstrated methods last time for a user to apply a type of entity embedding to the returned sets such that the number of columns in the output could be reduced, in a way using linear orthogonal projections to pair down the amount of redundant information in the returned columns. With our new update, we’ve implemented an automated PCA application for cases where our amount of training data is below the scale that could be supported by our number of features. We’ve done this currently in a sort of arbitrary fashion for cases where number of features is more than 15% the number of observations, I suppose the intent is to conduct some more research on this matter to hone in on improved heuristics, wanted to get something in place in the mean time (a user can also pass a different ratio if desired). We’ve also introduced a few different methods for our default style of PCA, where previously we defaulted to linear PCA, we now allow for Sparse PCA with expected memory improvement for dimensionality reduction and also Kernel PCA for capturing improved non-linear transformations for all non-negative sets (all via scikit learn as currently implemented). Note that we’ve also added a new min-max scaling method intended to support Kernel PCA such that a floor is put in place for test data points below the minimum of the train data (which can be assigned with new category ‘mnm6’). The whole point of this is to allow a user to conduct dimensionality reduction such that our predictive models may be applied in conditions flirting with the curse of dimensionality. Consider that the surface area of a unit hypersphere increases with the number of dimensions only up to a point and then starts to fall with added dimensions. Well that’s kind of what our machine learning algorithms are having to deal with, when our training set exceed the number of features that can be supported by our scale of data, it’s kind of like our model is trying to extract information from the vanishing surface area of a unit hypersphere with increased number of dimensions. So we can apply PCA, a kind of entity embedding via unsupervised learning, to reduce the dimensionality of the train and test sets. Simple.

Now in turning focus back to the education domain, let’s draw a kind of spurious analogy between dimensionality reduction of training a machine learning model, verses dimensions of training a student which we’ll primarily do for literary effect. Traditionally, the literacies built through the education system included you know the 3 r’s: reading, writing and arithmetic so to speak. One of the postulates from Robot-Proof is that these fundamental basics of literacy will be insufficient going forward, and that students will need to build additional core dimensions of capability such as working with data (hey I wonder if anyone can recommend any good tools for that), understanding technologies (a good rule of thumb is if you can describe from first principles that’s a good start), and then those kind of cognitive skills that are more the purview of the humanities — things like systems thinking, entrepreneurship, cultural agility, and yes dare I say even those liberal arts variety of critical thinking skills (I seem to recall a Steve Jobs presentation discussing the intersection of technology and liberal arts, so yeah some companies have done pretty well with these domains). More centrally, reducing these dimensions to their core, a thesis of the work is that the key areas where workers will still have leverage verses the machines are in applied creativity and mental flexibility. Any domain where we’re simply applying equations and rote frameworks are ripe for automation, we’ll need to have the flexibility to carry concepts and interpolate between domains. And we’ll need a touch of the artist’s creativity to go with it.

The Rolling Stones — Midnight Rambler

Ok so I’m not sure if this dual alternating paragraph / topic format is really working, but you know what the die is cast I’m just going to stick with it. Heck a life without risk is not really a life well-lived. So let’s talk more about Automunge’s new PCA functionality. Another update this week was implemented for building an interface to make use of the scikit library with the PCA functionality, keeping with our design philosophy of allowing a user options between automation and specification (if this was a self-driving car we would still have a steering wheel). So a user now has the ability to pass commands to the application of automunge that elects the type of PCA to be applied (PCA, SparsePCA, and KernelPCA) as well as to specify the parameters from scikit associated with each of these methods — I’ll demonstrate the options in the companion Colaboratory notebook linked below. Further, we’ve also added comparable ability to pass function parameters to the MLinfill and feature selection machine learning techniques, which currently are based off of Random Forest methods via Scikit-Learn. To be honest I’m not really a full expert on all of the hyper-parameters from scikit, so now a user with some more expertise has the ability to build on these methods. Eventually I’m sure we’ll build a car without a steering wheel, I’ll let you know when I’m ready for that leap.

One of the challenges faced by our universities is the tremendous gravity of legacy curriculums. Between the semester length classes, the 4.0 GPA weightings, the formalized monolithic colleges and degree programs, well they leave very little flexibility to experiment. As a result, much of the experimentation is taking place outside of the traditional university system, and the pace of these experiments are only increasing. Sooner or later one of these innovators is going iterate to a working solution that materially disrupts the incumbents, and our public institutions will have to make some hard choices given their fixed cost structures and let’s be honest somewhat profane administrative overhead in mainstream practice. Robot-Proof talks about a transition from compartmentalized education experiences to life-long learning programs, as the only really viable means for workers to keep up with the accelerating pace of innovation. One proposal I would make is that perhaps our educators can learn something from you know like the Girl/Boy Scouts with merit badges demonstrating proof of goal fulfillment in various domains. Heck the whole concept of lectures and testing has been stretched to its limit, there are other means to facilitate learning, and I suspect part of the dominance of a university lecture schedule is the somewhat tangible tuition-for-time exchange. What if tuition didn’t buy you time, what if it bought you lifelong access to resources for education and explorations at a self-guided pace? I mean if we’re going to expect students to take on practically life-long debt in pursuit of a degree, we should certainly provide some degree of life-long services. This is a service I would gladly pay for. There’s a huge untapped market of professionals that may benefit from access to a university library system or resources — let’s freaking sell it to them!

The Rolling Stones — Sweet Virginia

Another major undertaking for this week’s Automunge 2.0 rollout was the development of the infrastructure for feature engineering transformations of labels to go along with our processing of training set columns. I recall an old TWiML podcast interview of Google engineer Jeff Dean in which he described the benefit to a machine learning project of presenting labels to the algorithms with adjacent features subject to prediction. I mean this is one area where advanced machine learning frameworks like Tensorflow have a material advantage over Scikit for instance, in that a model may be prepared with multiple output neurons generating simultaneous predictions. That’s kind of inspired our development of our generic label concepts, but we now have built in the potential for users to make use of our generational family tree framework to assign a series of transformations to a label column just as was done for feature columns. Of course since our automunge predictive components that make use of the labels (such as feature importance evaluation) are currently built off of Scikit-Learn implementations, that means we have to compromise in that only one label column is used internal to automunge even if multiple columns are returned to a user. This is achieved with the addition of a new ‘labelctgy’ entry in the “processdict” object which a user can pass to automunge for custom label-processing methods, I’ll demonstrate in the companion notebook linked below.

This essay has talked a bit about preparing students for the workforce in the context of transitions in the nature of work, but let’s turn to a second and a somewhat symmetrical question, how can universities leverage these same emerging technologies and nature of work paradigms for their own use? This is one area that Robot-Proof doesn’t spend a lot of time, perhaps Aoun is saving that for the sequel. From a high level standpoint, it occurs to me that some of the more obvious routes for leveraging emerging tech are in domains such as information indexing and retrieval (e.g. search engines), student interactions and coordinations (hey Facebook was founded as a resource for Harvard students remember?), and perhaps eventually even algorithmically tailoring pace and content of curriculums to the strengths/weaknesses of individual students. I’d like to offer here a new hypothesis: in intersection of emerging tech such as machine learning with our education system, there is just as much leverage to be gained in addressing the strengths/weaknesses of the teacher as there is for the students. I suspect more attention deserved with the former. Heck (as an outsider) I speculate it doesn’t help that even in established schools teachers are primarily solo practitioners with only the bare support of a few administrators. That’s just the nature of status that teachers don’t get to collaborate with their immediate peers very often. This is just as relevant to K-12 as it is for higher education. If we can improve the outcome of a student we impact that one student, if we can improve the outcome of one professor, well hey we now have a multiplier. That’s leverage!

The Rolling Stones — Not Fade Away

For our last Automunge 2.0 update, well let’s just say I saved the best for last. Although most of this functionality had been rolled out several months ago, it occurred to me that we never really gave it the sufficient treatment it deserved in this forum, so yeah let’s address the elephant. Automunge has invented a method for oversampling data in conditions of uneven distribution of training data, aka the class imbalance problem. Let’s say we have a data-set intended to generate classification predictions for whether a student should be placed in a “gifted” program (purely hypothetical case), and we want to train a model to evaluate our student population. Well you know by definition those students considered gifted in some domain will only have a few examples in the training set. A machine learning algorithm may benefit from higher prevalence of outlier data-points in the training operation, and so we’ve created a method whereby label categories with lower prevalence in the training set are duplicated to facilitate a more equal distribution of classes, which can be implemented by activating the ‘LabelFrequencyLevelizer’ argument passed to automunge(.). Further, we’ve also developed a comparable method for numerical set labels, taking advantage of our standard deviation bins. If this oversampling method is activated for numerical labels such as may be the target of a linear regression problem, the function will round out the dataset with duplicates of outlier points. Voila! I guess the takeaway here is that outlier classes require and deserve both special focus and special treatment.

Aoun reminds us that the institution of the universities have evolved to serve multiple imperatives. In addition to education they also are sources of research advancing the nation’s scientific agenda. In his 1945 plan laying out an agenda for university based research, Dr Vannevar Bush proposed a system in which government funding would flow into universities for four key purposes: to create new knowledge, educate the next generation of scientists, create new products and industries, and advance the public welfare. Consider the number of startup companies and new economy jobs that have been facilitated by commercialized university research, with applications ranging from sports beverages to search engines and everything in between. Our universities have become one of our nations key competitive advantages, and the certain coming transitions in their conduction must build on these imperatives as well. I’ll offer in closing that it is one of the great new enabling factors of the paradigms of machine learning that we will be increasingly able to efficiently incorporate a tailored experience to the quarks of each of our students and teachers, playing to their respective strengths and weaknesses. If a student is gifted, we’ll be able to connect them to peers who may be well-suited to their particular type of extraordinary intellect. If a student is deaf or blind, we will be able to build off of the existing curriculum and reach them through other means. Through experiential learning we’ll facilitate cognitive flexibility and creativity, we’ll parallel help our students explore a world of new capacity for interacting with data and new types of social contracts. And yes, even with all of this change, we can still save a space for the enthusiastic cheering on of our university football team. (Go Gators!)