Machine Learning and Miscellanea

The explorations of a tourist

Published in

From the Diaries of John Henry

17 min readFeb 23, 2019

“Study hard what interests you the most in the most undisciplined, irreverent and original manner possible.” — Richard Feynman

Bruce Springsteen and the Seeger Sessions Band — John Henry (live)

Introduction — On Tourism

Virgil’s Tomb by Moonlight, with Silius Italicus Declaiming, 1779, Joseph Wright, on display at the MET

“Empty your mind, be formless. Shapeless, like water. If you put water into a cup, it becomes the cup. You put water into a bottle and it becomes the bottle. You put it in a teapot, it becomes the teapot. Now, water can flow or it can crash. Be water, my friend.” — Bruce Lee

Paul Simon — Call Me Al

Part of the reason that machine learning is such a fun discipline to explore is because of a certain openness in the research community. Pretty much all of the mainstream tech companies publish papers or contribute to open source initiatives (yep even that one), and there is a thriving community of writers publishing works in the public domain ranging from academics on arXiv to amateur works (such as this one) on Medium. Heck there are even freely available coding libraries that abstract the complexities of an actual machine learning implementation to fairly manageable levels, even for those with a less advanced coding background (the open source Automunge tool for automated data wrangling comes to mind for instance). It’s certainly a big city, and there’s a lot of resources to choose from. This post will pick a few stops somewhat willy-nilly for exploration and yeah let’s see where it takes us. Warning: I do not have a plan for this essay, am just kind of making this up as I go along. The proverbial “winging it”. Let’s see where the tour takes us.

I’m originally from Florida, not just Florida but Orlando — sort of the Florida of Florida, so I think it’s fair to say that I have a decent familiarity with tourism and it’s trappings. When you’ve been to Disney as many times as I have, you start picking up on a few best practices for exploration. Let your attention wonder. Just like when you’re in an art gallery, it sometimes helps to study the same painting up close — like the brush strokes and the technique, and then to look again after taking a few steps back, such as to appreciate the composition or interpretations. Don’t go into a park with a set plan and agenda, let the day take you where it will, and don’t try to see everything in a single visit. Smile and say hello to those whom you may be lucky enough to cross paths, sometimes the people watching can even be more fun than the attractions. Strike up a few conversations, ideally without an agenda. Take a few moments to pause here and there and let the novelty soak in. In short, try to enjoy yourself.

First Stop — The MET

“The two dominant components that constitute a city, its physical infrastructure and its socioeconomic activity, can both be conceptualized as approximately self-similar fractal-like network structures.” ― Geoffrey West, Scale

Paul Simon — Graceland

One area of machine learning that probably doesn’t get enough attention is the probabilistic distributions of target variables — applicable to both training data and labels that is. I haven’t found a specific study on this matter, but I suspect there could be ample ground to evaluate the usefulness of different neural network architectures based on their inference efficiency for different categories of distributions. I think there is an unstated assumption in much of mainstream work that data sets are provided with normally distributed features. Just as in financial risk management where the gaussian assumption can prove to fragilize a theoretically valid trading strategy implementation, I suspect this is a potential weakness for automated implementations. In mainstream theory deep learning is supposed to progressively negate the need for a concerted feature engineering address with deeper networks. I suspect it may be appropriate for researchers to reconsider this approach in the cases of less tame distributions such as variables exhibiting power law characteristics for instance.

Consider the MET museum in New York, a literally priceless collection of relics and art dating back to antiquity. If you consider the distribution of viewings for the pieces in a collection of art and artifacts, there may be a kind of power law distribution such that pieces that make it into this particular collection achieve literally orders of magnitude more cumulative visibility that what might be the case for the vast majority of other works from the same period. One could try to derive an average number of viewings for the pieces based on a randomly selected sample of the original collection, but in practice the influence of outliers on that mean would likely render any such sample mean useless. In fact, a more useful heuristic to evaluate the mean of a power law distribution inferred from a sample could merely be to take the maximum of that sample as opposed to the mean of the sample (HT to the RWRI crowd and I believe Raphael Douady for this observation). In the case of machine learning, it would likely take orders of magnitude more training data for an algorithm to infer this outlier influence than what would be the case for gaussian distributed data, hence the potential benefit of data set preparation via a feature engineering address.

Second Stop — A Detour into Activation Functions

“You must create a female for me, with whom I can live in the interchange of those sympathies necessary for my being. This alone you can do; and I demand it of you as a right which you must not refuse.” — Mary Shelley, Frankenstein

Van Morison — Domino (live from..It’s Too Late to Stop Now…Film)

Turning back into machine learning, and again drawing towards some parallels on financial engineering, let’s consider some minutia of the algorithmic implementation of neural networks. As an input signal is fed through a vanilla layered dense neural network (for either prediction inference in a feedforward manner or in initial training also transferred in the reverse direction through backpropagated gradients), each neuron transforms the signal by way of summing the product of each neuron from the preceding layer with the associated weight derived through training. Once that summation is computed, the output is fed into a generic activation function for the neuron, which very simply takes the single numerical float value from that summation and applies a defined method to output a transformed float signal for comparable treatment to the next layer’s weights and neurons. Now activation functions turn out to have some potential variety, and it was actually the specific realization of the RELU (rectified linear unit) activation that helped facilitate the current renaissance in deep learning as it is robust to problems with other functions such as vanishing or exploding gradients with deeper networks. However the RELU activation isn’t the only way to skin a cat, some proposed variations on its mechanism have been discussed previously in this blog, and other types of activations such as convolutions or LSTMs may employ other constituent elements — the point being that in modern practices there are some approaches where the activation function may be derived from multiple weights per neuron (such as the various sigmoid and tanh gates within an LSTM).

The somewhat speculative point I’m driving toward, and am going to depart from trying to tie this in to tourism here (this isn’t a poem I’m allowed to deviate from conventions of form), is that this RELU function itself, although currently a pretty strong contender for mainstream convention in practice for deep network activation functions, may not necessarily be optimal for all cases. Consider that the RELU activation bears some similarities to an option exposure, with a strike price at the zero point and a linear increasing payoff above. Now when options are priced, they won’t necessarily follow a linear value with change in security price, its a little more complicated than that (a good resource on this point will be the upcoming Technical Incerto by Nassim Taleb for more, hat tip again owed to RWRI discussions). In fact a way to think about option pricing is that it’s possible to recreate an arbitrary even nonlinear function of the underlying security price using some collection of puts and calls at various strike prices, so one way to thing about a neural network with RELU activations then is that it is merely a collection of call exposures which feed into each other and our model predictions are like predictions of market value. Now consider what may be possible by adding a put exposure to the call of each neuron’s RELU. If you consider that training a LSTM takes something I think like 3–4 times as many weights as a vanilla RNN, it’s not unheard of to increase the number of weights to facilitate increased model performance. What if there’s an opportunity to incorporate a counter activation function with each neuron, let’s call it a “NELU” for negative RELU, perhaps with an additional weight for the “strike price” for instance. Yes the number of weights would go up, but might there be potential for a corresponding reduction in the required network depth for comparable performance? This is just speculation there might be some opportunities for an enterprising researcher to run some experiments here. As a brief aside, note too that the RELU function by design progresses linearly unconstrained with increasing input, this is actually counter to the mechanisms realized in biological neurons, in which the activation frequency of a source neuron eventually saturates.

Third Stop — Let’s Talk About Speech

“The difference between speech and talk is like the difference between entrepreneurship and academia.” — Oh yeah I said that.

Lauryn Hill — Everything is Everything (video)

The art of tourism is one that is at best interacted with, such that avoiding simple voyeuristic consumption can make all of the difference between entertainment and adventure. Consider the difference between modern applications in machine learning natural language processing verses the babbling of a toddler — actually for a helpful demonstration of modern capabilities the folks at OpenAI recently released some demonstrations which I’ll link to here. For the OpenAI generated example of the unicorn valley, the coherency of artificially manufactured speech is certainly quite striking, however it should be remembered that this is actually an example of the speech content following a preset agenda, as a constraint for an optimization problem. Just as in the mainstream media talking heads where the presented arguments are simply a function of the agenda (Socrates probably turning in his grave), there is no reasoning taking place in this algorithm other than (as a simplified way to think about this) the machine trying to answer the simple question: what speech would be most consistent with the form and content of the seeded input. Now the babbling toddler is an interesting counter, for although there is certainly an element of this same mechanism at play in their babbling, parallel they are also exploring the space of language and thought, in a manner that probably bears some resemblance to reinforcement learning, such that they are experimenting with their speech such as to observe it’s effect on their surroundings. In other words, they’re not just crafting speech to an agenda, they are crafting speech in a manner to learn agenda. Sure it’s incoherent, but from an Artificial General Intelligence (AGI) standpoint, this style of communication would be much more impressive, and coupled with the capability demonstrated for the unicorns demonstration, well perhaps this could be one viable route to AGI itself.

Superintelligence — Nick Bostrom

Now I mention the OpenAI example partly because it’s getting some media attention and I’m hoping to piggyback on their work by offering a creative take with the thought that it may facilitate some small amount of foot traffic for this dusty corner of the web. (Consider this an experiment.) In their post announcing this demonstration they note that the fully trained model (or dataset and training code as well) will not be released to the public in the interest of avoiding potential for misuse. Consider the potential for fake news if it can be created on demand tailored to each specific consumer based on their media habits for instance. Now I think part of the reason that this approach has drawn some attention is the source, from a nonprofit research institute with the word “open” in it’s title — no one would fault Google or IBM for instance if they kept their research from the public domain. Open AI is a really interesting player in the machine learning space, one needs look no further than their mission statement for an agenda: “OpenAI’s mission is to build safe AGI, and ensure AGI’s benefits are as widely and evenly distributed as possible. We expect AI technologies to be hugely impactful in the short term, but their impact will be outstripped by that of the first AGIs.” Now for those who might question if this withholding is consistent with that model, consider that the stated mission is to ensure the benefits of AGI are widely available, not necessarily the AGI itself.

Part Four — Crypto

“Meaning lies as much
in the mind of the reader
as in the Haiku.”
― Douglas R. Hofstadter, Gödel, Escher, Bach

Tedeshi Trucks Band — Midnight in Harlem (live)

For a tourist in exploration, one way to think about a large city is that it is a collection of interdependent systems, some that may be highly constrained with top-down governance, others may more decentralized and self-sustaining. Consider the Indian Dabbawala industry found in cities like Mumbai, a highly decentralized system of labor organization dating back centuries that evolved to massive scale for the simple task of delivering home-cooked meals to workers during their lunch break. The routing mechanisms for individual lunchboxes almost bears some similarity to IP packets on the internet, but with color coding instead of numerical signals and human mechanisms for transport. Now contrast with modern metropolitan personal transportation infrastructure that have been designed for some similar mechanisms, in recent times upgraded from the randomness of taxi service availability to the planned routing of independent drivers through a hybrid central planning / decentralized labor services like Uber. Such a transition has benefited the drivers from the standpoint of offering increased utilization density and location intelligence, however it will certainly prove to be a faustian bargain for it has always been the stated agenda of this service to phase the service from human to artificial drivers. Such a transition will certainly have benefits to the city, such as enhanced traffic coordination, pedestrian safety, and simply the cost of transportation itself. The displacement of such a large working class will certainly have material impacts to a city’s economy, and the navigation of this type of transition will require some care and intention in coordination between industry and government.

The question of transitions in centralization and governance is certainly relevant to a financial hub like New York city with the pending paradigm shifts for decentralized proof of stakes and digital transactions being realized through a mess of competing platforms in the cryptocurrency space. I’ve written about bitcoin before from an investing standpoint, perhaps here is worth a short address from a utilization standpoint. First the elephant in the room for bitcoin is certainly the outsized holding from the silent founder, known simply as Satoshi Nakamoto. Any question of downstream adoption must certainly take into account what will happen if or when that account is accessed, which will instantly become known via the public ledger of transactions. Putting myself in this founder’s shoes, I would imagine that anyone active in bitcoin from the early days probably already has a sizable holding through other means such as mining, and my expectation is the risk of shedding anonymity and currency stability probably won’t be considered if at all until the portion of transaction volume is materially devoted to some use other than speculation. Although designed for some materially different use-cases, I think the most noteworthy transition in the crypto realm to be the planned phased shift for Etherum from proof of mining to proof of stake. The question of sustainability of mining is I think one relevant to not just cryptocurrency governance but public governance as well, and applaud the Etherum proactiveness in adopting more sustainable practices, which I expect will be rewarded by the market in kind once the feasibility has been proven at scale.

Part five — on power laws and scale

“G, E, B is a great book, but it’s hard: I personally don’t think Hofstadter does enough teaching of the basic concepts to make his riffs and dialogues come alive for people who didn’t have a lot of basic logic and recursion-theory in college… In general, things that sell really well… are usually dreck.” — David Foster Wallace

Hamilton — Not Throwing Away My Shot, Tony Awards performance

The author Nassim Taleb (who FYI will be releasing soon his collection of technical papers serving as companion book for the Incerto) is known to draw on the example of a Manhattan eatery to illustrate a property of power laws (a category of probability distributions exhibiting ‘fat tails’ in which outliers outweigh the majority points for deriving statistical properties of the set). The story is that from a discussions between a collection of Broadway producers and actors that would congregate at a local cafe known as Lindy’s for burgers and cheesecake a rule was coalesced in which a means for estimating the remaining longevity of production runs for Broadway shows. Specifically, the rule was that you could estimate the total run for a show (say in years) by simply doubling the current age of production — you know, Hamilton has been on stage for four years now, so the expectation is it will be performed for four more. (The rule works better in aggregate, I think most people who follow these things would expect Hamilton to be around much longer than that.) The reason the Lindy effect works is because of the power law distribution in the duration of shows in aggregate, and it’s application to other domains presented in the Incerto such as durability in paradigms of technologies (say the difference between the telephone and fax machine) translates because these categories of analysis exhibit such distributions.

It occurs to me that there is no need to tie down this Lindy rule to simply an application in the time domain. Another useful application could be the domain of scale. Consider the population of cities follow a power law in distribution. An application of the Lindy rule to this domain could be interpreted to mean simply if you want to estimate how large a city’s population will reach during its span, simply double the current population. Or alternatively, assuming that popularity of cryptocurrencies follow a power law distribution, if you want to estimate the maximum transaction volume that will be reached for a given platform, simple double the current transaction volume. Granted this is a loose estimate, it’s not exactly statistically rigorous, but in applications with a high degree of uncertainty it can prove to serve as a useful heuristic for guiding decisions. A somewhat ironic point to the origin of the rule’s name is that the Lindy cafe serving as it’s namesake recently went out of business, perhaps reinforcing the point that there is of course a material difference between an estimate of expected durability and the actual realized in the real world. Perhaps we need to find a new cafe for the rule.

Conclusion

“An interesting apparent paradox is that … (a type of) longer-term predictions are more reliable than short-term ones, given that one can be quite certain that what is Black Swan-prone will eventually be swallowed by history since time augments the probability of such a event.” — Nassim Taleb, Antifragile

Mozart — Sonata XIV, 2nd

In case this wasn’t obvious by now, this sort of haphazardly organized wondering was partly inspired by discussions and explorations surrounding a recent visit to a small conference held in New York known as the Real World Risk Institute. It’s been a great pleasure getting to pay visit to this diverse collection of risk-takers from year to year, watching the agenda grow from the foundations of the Incerto to include whole new domains of research and application. There are always some surprises in store, always fertile ground for new friends and familiar faces. The one rule I recommend for these events is don’t go simply as a tourist — after all, a mere tourist has no skin in the game. Take some risks, be prepared to make some investments, always be searching for intersections between the theory and practice (after all theory without practice is just tawk tawk tawk). The world is an unpredictable place, but for those who know where to look there can always be found the seeds of green lumber waiting for harvest.

Taleb’s book Antifragile was I suspect by intention written outside the domain of physics, and as a result mostly left unaddressed a vast domain of scientific theory for harvesting disorder in physical systems. The second law of thermodynamics states pretty succinctly that the entropy of a system will trend to increasing disorder with increasing time, as highly organized states trend ever more towards randomness. The trend of humanity to ever increasing organization — to bigger cities, to more elaborate means of control and governance, is actually simply a byproduct of our harvesting the driving force of our sun’s gradual dissipation, without which the mechanics of life would have never been realized — all of the energy that drives our civilization, whether fossil/wind/solar/nuclear or biological for that matter, we have to thank from this relentless engine of fusion that holds us by her gravity. But the second law is interesting because unlike most other physical law, its equation is not an equality, or even in its purest form a hard and fast rule — the second law is probabilistic. And although the vast vast majority of realities in the multiverse will see this relentless march towards disorder, there could always be at least one reality where a dropped tea cup reassembles, where the non-ergodic returns to a prior form. There is no mistake that can not be unmade, no sin that can not be forgiven, no life that can not be saved.

When He had finished speaking, He said to Simon, “Put out into the deep water and let down your nets for a catch.” — Luke 5:4

For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations.
For some really nifty open source software that automates data-wrangling of structured data-sets for the preparation of machine learning, check out automunge.com!