A New Kind of ML

Shallow Learning and other hypotheses

Published in

Automunge

15 min readDec 18, 2019

I recently had the pleasure of attending the 2019 Neural Information Processing Systems conference (aka NeurIPS), sort of the top conference for machine learning researchers based on my understanding. I speculate that I have the new ticket dispersement lottery system to thank for the privilege, the conference has some exclusivity of attendance, and the privilege certainly wasn’t due to the submittal of my first acedemic paper on subject of dropout regularization which was somewhat poorly received to be honest (wasn’t bland enough or more importantly sufficiently experimentally validated theory). It would be hard to overstate just how impressive was the experience — every presentation representing some new frontier of research, every conversation containing seeds of some new channel for exploration, every poster the extraction of months of research, every attendee an exceptional mind of their own. A hyper-dense stream of inspiration.

For those that haven’t been following along, I’ve been using this forum over the last year or so to document the development of Automunge, and open source platform for the automated preparation of tabular data for machine learning. The project started somewhat haphazardly as a learning and building experiment, but through the practice of building out between the defined boundaries of inputted “tidy” tabular data and returned sets suitable for direct application of machine learning, I believe the project has realized some material improvements to the traditional python based data wrangling workflow that may precede the application of machine learning.

In application, a user may defer to automation for minimal feature engineering methods based on properties of data, or may alternately assign custom feature engineering methods or even custom sets of feature engineering methods including generations and branches assigned with our “family tree” primitives. Such feature engineering transformations may be sourced from our internal library, which we are continuing to build out, or may also be custom defined by the user, making use of just a few simple data structures, for incorporation into the platform. I won’t try to list everything here, but some of the novel aspects of value include the segregated application of feature engineering methods between sets intended for training, validation, or “test” (for generating predictions from a model) based on properties derived from a training set, such as for consistent basis of normalization and to avoid data leakage between training and validation data. Automunge also has some neat stuff to address conditions of class imbalance in labels by preparation of training data for oversampling components with lower representation in the data, which may be applied to categorical labels or even numerical labels based on binned aggregations. The list goes on, including automated machine learning derived infill, automated feature importance evaluation, automated evaluation for data property drift of distribution properties between training data and subsequent data, oh and even automated label smoothing such as with our new “fitted label smoothing” method more on that below.

To be honest, I’m much more of a builder than a researcher, and a lot of the hypotheses that have contributed to the Automunge development journey remain to this day mostly that, a bunch of hypotheses. Some recent effort has gone into the beginnings of experimental validation, such as for instance a foray in Kaggle competitions which I documented in the essay “isFraud?”, but it certainly wasn’t a winning entry so some skepticism is perhaps justified. The reality is that the bulk of inventions that have facilitated the Automunge platform have been in the data wrangling arena, those activities that precede the application of machine learning, and that’s really the intended use case. Automunge is built as a resource for those steps of data preparation that immediately precede the training of or inference from a predictive model, currently focused on tabular data received in tidy form (one feature per column / one row per observation). And well, while I was sitting in on all of these incredible presentations at NeurIPS, some additional coherence of hypothesis sort of started to crystallize, which collectively I think may serve as a useful argument for the merits of the tool. I’d like to use this essay to present for consideration these hypotheses, along with a few NeurIPS inspired musings along the way. And without further ado.

Hypothesis One — Efficient Learning

One of the highlights of the event was a full day workshop centered around the intersections of climate change and machine learning. (Followers of this blog may remember that I published an essay on the subject a few weeks ago.) The workshop was a collection of talks presenting research around a diverse collection of domains within proximity of the issue, for example including ML applications such as modeling vibrational modes of plasma for fusion energy generation, natural language processing adapted to wind turbine maintenance domain, improving solar panel production, “subseasonal” time scale weather forecasting, time domain forecasting with exit barriers applied to weather balloon sensor deployments, graph networks applied to forecasting grid scale generation system deployment, applying reinforcement learning to ride-share self-driving car deployment, projecting tree species’ potential for adaption to climate, evaluating satellite imagery for various targets such as wind and solar deployment or methane content of polar ice caps, amongst several others.

In addition to the presentations, a notable panel was conducted in which several high-profile machine learning researchers lent their attention to discussions in support of the climatechange.ai community. The type of discussions were not highly domain specific, it was more focused on generalized prioritization and best practices for researchers looking to make an impact. There was a broad consensus of a need to proactively reach out to domain experts in other fields, whether that be by reading papers in other domains or perhaps more importantly to seek collaboration with people with different and perhaps complementary expertise, maintaining a little humility about our ability to come in and understand the full complexities of a problem space just based on some collection of features and labels. It was noted that traditional researcher metrics of publishing count may not be the best measure for success, as a researcher’s job is not done when some paper is published, only when that paper has successfully had impact. There were some ethical questions raised as to what extent prioritization of support (such as access to educational resources) should be made in considering the carbon intensity of a practitioner’s industry of application, which consideration may prove thorny without some kind of due process. A highlight of the talk originated from a point raised earlier by Jeff Dean, that the framing of the problem should consider the amount of leverage available at different scales to impact carbon emissions — e.g. the difference between an individual’s average of ~15 tons/year, to a Fortune 100 company’s impact on order of ~50 million tons / year, or to a large city’s impact on order of ~100 million tons/year. The key point though, and probably biggest action item, is the need for a community of those looking to contribute to the initiative, a need for us to help each other without the expectation of return, perhaps just simply the need to be kind.

With respect to the Automunge platform for automated data wrangling, there have been some recent updates to our methods that originated from a hypothesis related to this question of machine learning and climate change. More specifically, one of the questions posed by the researchers in the climate change workshop spoke to the question of carbon intensity of computing, there was actually even a python library poster at the workshop (“energyusage”) which generated reports of energy intensity of passed code, this could definitely be one practical channel of validation for a user interested in this question. A related tutorial at the conference for this question was conducted by Vivienne Sze on topic of “Efficient Processing of Deep Neural Networks”, where she noted an interesting point: machine learning cost of energy (presumable for both training and inference) scales quadratically with bit width — or I believe in other words with floating point precision of data set number representations. In Automunge we have a parameter “floatprecision” which can be passed to set floating point precision to 16/32/64 bit representations — this currently defaults to 32 bit, so a user interested in a condensed representation can also pass 16 here which is a tradeoff between numerical representation precision. In theory since one of the benefits of the tool is consistently scaled normalization most data sets should be able to accommodate a condensed float precisions such as 16 bit. Our boolean activations are defaulted to 8 bit representations. As hardware catches up with the ability to perform operations on sparsely represented sets the intent is to support those aspects as well.

More novel perhaps, another Automunge feature that we’ve discussed in this blog previously involves the capacity to represent categorical features with binary encoded sets, as an alternative to one-hot encoding — available for two-value sets with the ‘bnry’ category of transform (for a single returned column of 1/0), or for >2 category sets with the ‘1010’ transformation category. The hypothesis here is that by reducing the number of columns in our categoric representations, we are reducing the number of weights for associated models. I speculate, and this is just speculation, that there is a probably a threshold of category set size beyond which efficiency of dense representation is offset by network depth needed to extract properties, some experimentation here would probably be beneficial. Further, we’ve even recently extended the options of binary encoding to an even denser representation, in which the set of categoric columns can be collectively fed into a single binary transformation, which would even further reduce the column width of representation — this may be beneficial for battery constrained edge devices, this is mostly a hypothesis though. Of course these methods are most beneficial for sets that aren’t being trained with category embeddings, which to be honest I don’t consider myself a full expert on, but from a practicality standpoint if a user wanted to experiment with these methods in context of using selected category embeddings they can do so by passing target categories for embedding to the ‘ord3’ transformation category for instance which is an ordinal encoding with representations sorted by frequency.

Hypothesis Two — Shallow Learning

Another workshop I attended was themed around yet another intersection, of machine learning and information theory. The research in this field was a little more along the theory end of the spectrum, and in a few cases admittedly over my head, but even so many of the discussions were thought-provoking. I was introduced to the concept of Fisher information in a particularly interesting talk on work by Alessandro Achille & Stefano Soatto which sought to relate information content of predictions, measured by Shannon Mutual Information, with information content of the weights, measured by Fisher Information, and in so doing derive a bound for information content stored in the weightings of a neural network. One of the findings of this work was a generalization that models of low complexity tend to generalize better, as kind of relates to my dropout paper I mentioned earlier. Another entertaining discussion was offered by Alexander Alemi on the premise of a debate style presentation of merits for the benefits of compressed representations for neural networks (he was kind of preaching to the choir tbh).

The questions of model complexity, representation compressions, and information content are kind of adjacent to another key hypothesis of the Automunge platform, on the benefit of feature engineering. I’ve gotten the impression that “feature engineering” has kind of become discarded as a credible domain of research, as the advent of deep learning has supplemented the task. One credible counter to this philosophy was offer by Julian Zilly, a participant of the information theory workshop, that neural network architecture engineering is in its own way just another kind of feature engineering, such that every new customization of a model’s architecture to fit some particular training domain is just a replacement for a corresponding transformation of feature representations — consider this practice taken to its extreme with weight agnostic neural networks. Zilly presented a poster with the iconic image of Einstein in profile, and then adjacent the same image broken into a scrambled grid of squared patricians. Each representation contained the exact same information, but clearly for the scrambled version much harder to extract meaning.

Now consider another talk that was certainly a highlight of the conference, the talk by Celeste Kidd on subject of human early development learning. Kidd gave an example of offering a child a picture book of ABC’s or the alternate of providing a scholarly textbook written in a foreign language. Now in the second book there is certainly a great deal more that the child could potentially learn, such as both a foreign language and the scholarly topic, but the child is much more likely to extract some meaningful insight from the alphabet book. This may bear relevance to the domain of feature engineering. One of the hypotheses of the Automunge platform is that there may be benefit to presenting our tabular features to the machine learning algorithms in multiple forms, but unlike the invertible representations such as those discussed prior for the scrambled Einstein image, here specifically referring to multiple forms of varying information content. Some examples of this for a numerical stream could be the assembly of bins from a numerical set based on number of standard deviations from the mean (such as with the Automunge ‘bins’ transform category), or as another example the graining of numerical sets to equal width bins (such as via the Automunge ‘bnwd’ transform category). In each case this new representation will have lower information content than the original set, but as a result the model may benefit from the ability to quickly extract meaningful insights in earlier epochs of training.

The benefit of these reduced information representation may be less significant in paradigms of big data scale training sets and deep networks, but consider Andrew Ng’s point at the Climate Change panel that industry needs more tools for small scale data sets, what I will refer to as “Shallow Learning”. There are many applications where a practitioner may desire to train a predictive model without the availability of a big data scale training resources, in these cases the benefit to early epochs of presenting features with multiple representations of different information content may possibly have a material impact on final model accuracy (again this is at this point still a hypothesis). This is mostly speculation, but the benefits of presenting tabular features in multiple configurations of varying information content may even have benefit for deep models on scale of big data. Consider the poster presented by Gal Kaplun based on the paper “SGD on NNs Learns Functions of Increasing Complexity”, which offered that early epochs of stochastic gradient descent bear resemblance to a linear model, and that even as the training is allowed to progress elements of this original linear model are retained. Thus there may even be benefits to later stages of training associated with facilitating efficient extraction of properties from data in early epochs to improve this early stage linear model characteristics. As a quick tangent I would offer as a suggestion of further research to these authors that it may be fruitful to explore whether there is some relevance to the recent Open AI paper on the Double Descent phenomenon, such as whether the retention of linear model characteristics are the portion that suffer during the phase change that Open AI characterizes as transition from Classical Statistics to Modern ML.

Hypothesis Three — Calibrated Learning

Another presentation that was particularly striking, partly for its usefulness but also due to the potential simplicity of implementation was the concept of label smoothing as presented for the paper “When Does Label Smoothing Help” by Rafael Müller, Simon Kornblith, and Geoffrey Hinton. Label smoothing refers to the practice of replacing label one-hot encodings from the 1/0 designations to some lower decimal representation for activation and increased decimal threshold to represent null — as one example replacing activations with 0.9 would result in the conversion from 1/0 to 0.9/#, where # is a function of the number of categories in the label set such that the set sums to unity — for example for a boolean label it would convert 1/0 to 0.9/0.1, or for the one-hot encoding of a three label set it would be convert 1/0 to 0.9/0.05. This simple heuristic helps the model account for potential sources of error introduced in label noise which would otherwise hurt the model’s ability to generalize. It turns out this practice is also useful in a model’s probabilistic calibration, potentially serving as an alternative to “temperature scaling” or “platt scaling” for improved calibration of predicted probabilities, albeit with the tradeoff that there may be some degradation of the resulting model’s ability to distill information between a teacher model trained with label smoothing to a student model.

Just seeing this paper on the NeurIPS agenda turned out to be pretty helpful, as it inspired the incorporation of a new push-button label smoothing method into the Automunge library, rolled out in v2.97 and further refined in a few iterations thereafter. In the act of building out this option, we kind of hit upon another new hypothesis, based on the premise that there may be benefit to fitting the smoothed null activations to the distribution of labels found in the training set, sort of an extension of the label smoothing methods presented at NeurIPS. After all if the purpose of label smoothing is to allow the training operation to account for noisy labels and to facilitate more calibrated models, the simple act of levelized null activations may serve as another source of noise. This new push-button “fitted label smoothing” method was rolled out in v3.0, and given the simplicity of application in context of an Automunge call I think it’s certainly worth consideration as a strong alternative to vanilla label smoothing for improved model calibration.

This concept of label smoothing falls under a broader umbrella of another hypothesis built into the Automunge platform, that of benefits to training associated with label engineering. Just as discussed prior of potential advantage of presenting our training features to the network in multiple configurations of varying information content, we expect there may be even more pronounced advantages to preparing label encodings as sets of adjacent properties such that a model may be training to predict each in conjunction. This is sort of related to the Jeff Dean comment from the Climate Change panel which I paraphrased as a call for training models on multiple adjacent domains simultaneously. The hypothesis is that a model trained to predict whether a work is a textbook in a foreign language may benefit in accuracy by simultaneously predicting whether the work is a book, whether the work is written in an English alphabet, and whether the work is fiction or nonfiction for instance. By encoding our labels in multiple forms, we are forcing our algorithms to pay attention to each of these different aspects in evaluation, and perhaps making bases of predictions more explainable in the process. I suspect this “dense labeling” insight may be important and of broad application to practitioners in various domains of research.

The NeurIPS conference far exceeded my expectations. It was inspirational and invigorating to be in the presence of so many brilliant minds hard at work expanding the frontiers of research. I was especially taken by the organizers’ willingness to incorporate invited speakers in biological sciences, as certainly we still have so much to learn from the frontiers of early stage human learning, genetics and cell development, or the workings of human brain as to what is possible. I got a laugh out of discussions with an IBM researcher on subject of the theory of mind, who offered that with every realized new frontier of science, researchers have a tendency to overfit theory on the human brain as a representations of such paradigm — in the early days of thermodynamics researchers sought to represent the brain as a Carnot heat engine, with the advent of string theory researchers sought to find evidence of quantum gravity’s influence, it appears that now the trend is to seek instances of backpropagation between our neurons. I don’t expect any single phenomenon will be the final answer, but I would like to offer that the mind is not born out of neurons acting in isolation, but out of the emergence of order from the chaos of seemingly random interactions, connections between nodes are formed and reinforced, and the whole is vastly greater than the sum of parts. Sort of like NeurIPS ;). Cheers.