From the Diaries of John Henry

Explaining machine learning to a toddler

Nicholas Teague
From the Diaries of John Henry
30 min readMar 26, 2017

--

LCD Soundsystem — All My Friends (Live)

I am glad to learn in order that I may teach. Nothing will ever please me, no matter how excellent or beneficial, if I must retain the knowledge of it to myself. … No good thing is pleasant to possess, without friends to share it.

- Seneca, Moral Letters to Lucilius

Spent a little time in past few months playing catch-up on the rapid progression in machine learning technologies, which have enjoyed a kind of renaissance in the last decade. Sat a couple of MOOC’s, watched a few youtube lectures, and read a book or few. This post will serve as a survey of several points that I found of interest along the way, especially those that may benefit the comprehension of the less “deeply” versed. It will hardly be comprehensive, after all am still only really getting started with the survey, but the goal will be to at least touch on some of the key foundational algorithms and challenges for the practitioner. Of those resources reviewed I highly recommend the lectures of Andrew Ng as a starting point from a practitioner, or for a deeper dive Deep Learning (a textbook by Goodfellow, Bengio, and Courville). I suspect that other versions of this type of post will already have been done by others on this platform, likely more polished or professionally, but as Seth Rogen rightfully concluded when faced with the new realization of a direct competitor to his Hollywood startup in the movie Knocked Up, ‘fuggetaboud’ it let’s build it anyway.

image via Knocked Up

The coolest thing about machine learning, I mean besides the obvious societal scale transformations that we are only now seeing beginnings of, is the extent to which cutting edge technology, research, and tools are available even to the general public through resources such as arXiv, CoursEra, Open AI, TensorFlow, GitHub, and countless generous bloggers and practitioners willing to share their experience. That is not to say that all public resources are of equivalent value, for instance have found that many of the popular books on the subject, recent publishings from mainstream targeting authors such as Jerry Kaplan, Pedro Domingos, or Brynjolfsson / McAfee for example, spend entirely too much time addressing the philosophical implications of things like workforce displacement, consumer shifts, etc. while doing less to illuminate exactly what machine learning currently is or how it works — without that background the common understanding required for philosophical discussion and debate is less assured. Thus the hope is that this post will in some small way help to bridge that gap between popular press accounts of machine learning and the more intricate foundational details of the how and the why of modern day practice for those interested readers willing to grant me some time (this post will be on the longer side but will provide a spot for intermission at the midpoint for those with time constraints). It’s definitely worth noting that I don’t consider myself an expert, and the gaps in my knowledge in both the history and practice are varied and deep, but have picked up a few nuggets of insight here and there so despite these gaps there may be some to be learned here, at least to the novice.

Use what talents you possess: the woods would be very silent if no birds sang there except those that sang the best.

- Unknown

As a practice I have adopted for previous posts of this nature, where possible will refrain from googling keywords and topics while writing the post, and will instead lean on insights gained over the course of surveying of literature and videos over the last few months (with exception of sourcing images or material that have previously reviewed). This means I’ll probably get at least a few things wrong, but the goal here is not to generate some definitive research paper, but instead to crystallize my current personal understanding of the subject, sort of a time capsule which any grandchildren can perhaps look back on and laugh at such naïveté in several years time, and if it benefits others all the better. If any concerned reader finds some especially egregious misconception, feel free to address in comments. Actually any comments are welcome — at least would prove that some human actually reads these posts.

The study of books is a languishing and feeble motion that heats not, whereas conversation teaches and exercises at once. … When anyone contradicts me, he raises my attention, not my anger: I advance towards him who controverts, who instructs me; the cause of truth ought to be the common cause both of one and the other.

- Michel De Montaigne

PART 1 — The Meaning of Machine Learning (and life)

Philip Glass — Glassworks (album)

Life without music would be a mistake.

- Frederich Nietzsche

Before diving in to the more algorithmic parts of the discussion, probably worth starting with some fundamental distinctions between the more high level concepts of artificial intelligence and machine learning for clarity. I don’t know if there is complete agreement on definitions of these terms, but the gist I’ve picked up along the way are that the category of machine learning (ML) is considered a subset or prerequisite of Artificial Intelligence (AI), with AI representing what will be achieved once we have reached some computer agent capable of producing behavior indistinguishable from that of humans — perhaps Alan Turing’s imitation game remains a valid metric for such achievement.

AI is commonly grouped into a “strong” version meaning that which achieves actual consciousness and a “weak” version meaning that AI which only simulates achieving consciousness but doesn’t actually do so. (Although I’m having trouble imagining a kind of metric to distinguish between the two — if it’s not possible to distinguish between strong and weak AI then what’s the point of labeling them? But then this is kind of what Turing is getting at isn’t he.) Machine Learning, on the other hand, has a lower bar to climb. We have a more concrete conception of what machine learning looks like because it has been achieved even today through neural networks which have ability to automatically extract and act on underlying properties hidden in some data. Whereas the “expert systems” popularized by IT firms in the 1980’s relied on generating a decision tree to replicate expert evaluation in various fields (this was an early approach to computer automation of knowledge worker labor which finally went out of fashion in the early 90’s due to limitations in dealing with complexity in the tail), the evaluation internals of modern machine learning are much less transparent for all but the simplest of systems. Their computations are hidden in the countless numerical weightings and interactions of artificial neurons (which I’ll simply refer to here as ‘neurons’) — more on this to follow. Actually the algorithms enabling machine learning neural networks aren’t even new, their origination dates back decades (the full details of their origination is a hole in my knowledge, again likely one of many in this post). The most recent renaissance of machine learning that led to the current boom has been realized mostly in the last decade or so, and was enabled by several new factors such as the exponential growth in big data from the rise of internet economy or embedded sensors, the advent of the ReLU activation function, and also the successful evolution of Moore’s Law expanding our processing capabilities for deeper networks and efficiency of training our algorithms — now typically accomplished through graphic processing units (GPU’s). That is not to say that all of our current algorithms date back to this original period, actually the rate of change and innovation in state of the art is quite rapid and accelerating, but the influence of original neural network architecture that started it all can still be found in most if not all forms of modern machine learning. Just as consumer household appliances have evolved over the years to gradually supplant a homemaker’s labor and chores workload, the foundational tech of neural networks will surely continue to surprise us in the coming decades as it grows into new paradigms of capability.

As for these original neural network algorithms that have since evolved to all of the sub-specialities that we will talk about in this post, when I described them as artificial neurons that is actually a good simplification for conceptualization. The human brain has billions of biological neurons and trillions of synapses (the interface points between neurons). Each biological neuron communicates with thousands of it’s neighbors primarily through electrical pulses of varying frequency transmitted through these synapse (biological neurons also have more subtle chemical / hormone influence as well, but won’t go into this or other less widely accepted theories for neuron interactions that have addressed in previous posts e.g. Roger Penrose’s quantum gravity hypothesis since modern ML algorithms do not have a digital equivalent).

A biological neuron — image source Wikipedia

It is in the weighting via frequency of these electrical pulses that biological neurons communicate and interact. When you combine the simple interactions of neurons at the scale of the brain there arrises an emergent intelligence and consciousness. The algorithmic artificial equivalent of the biological neuron, a neural network’s neuron, was designed to simulate a version of a biological neuron’s simple interactions. Instead of transmitting varied frequencies of electrical pulses, a neural net simply communicates between nodes with numerical signals. The input to the neural net could be any kind of data. The neural net takes a coded representation of that data distributed across input neurons, and then transforms the input via progression through hidden layers of neurons, leading to some outputted derivative of that data. More neurons in a row allow the net to address a more complex input, the more hidden rows the “deeper” the network and the more capability to perform analysis of increasing sophistication (but also in both cases the more computationally expensive to “train”). Each of the neurons in Input and Hidden layers, as shown in the following diagram, interacts with multiple child nodes, comparable to biological neurons’ dendrites reaching synapses of multiple adjacent comparable cells. The learning of a neural network is realized in the weighting of each of those interactions. For example in the following diagram for the transmission of input from neuron A to B there is some weighting applied to the value θA-B either strengthening or weakening the input signal (perhaps even reversing it’s +/- sign) and thus its resulting influence on the child node B, which will be subject to a combination of the values and weightings of numerical signals from all parent cells pointing in it’s direction to derive its output value, which is then transformed via activation function for feeding into the next layer. When we “train” a ML algorithm, we are performing a type of optimization evaluation to derive the combined value of unique weightings between each of the interacting neuron nodes which produce the best prediction or evaluation capabilities in the output for some yet unseen data.

A simple neural network made up of artificial neurons. image source: Wikipedia

I’ll get back to algorithmic considerations of training shortly, but first will step up a few layers of abstraction to discuss a first potential application that these ML algorithms are meant to accomplish, and in doing so will introduce one type of algorithm. A simple and highly illustrative form of ML is that achieved by a logistic regression evaluation, a type of classifier (not to be confused with a linear regression for continues outputs). A logistic regression classifier will have a single output cell with binary result, and could be used to classify a yes or no prediction based on input data. For example, say you wanted to determine if someone would be more likely to vote Republican or Democrat based on some data thought to have predicative capabilities such as, I don’t know, television news viewing habits. You could train a logistic regression algorithm using comparable labeled data of voters of known party affiliation. Once the training is completed and the algorithm is tuned, feeding some new data into the neural net will output say a 0 for Republican and a 1 for Democrat — or more specifically you would set some bar for output value (such as 0.5) above which you would consider the output a 1 and below a 0. Thus logistic regression probabilistically classifies input data to one of two categories. In this example we would expect someone who watches primarily Fox News to generate a prediction of Republican voting and someone who watches mostly pretty much any other mainstream news channel to at least have a more likely leaning to Democrat. Of course some more elaborate combination of news viewing habits coupled with social media connections or party affiliation of a voter’s parents may be less obvious and thus more useful for ML classification techniques.

image via wikipedia

An extension of logistic regression is possible through the softmax technique to increase the number of potential classifications from 2 to as many as your model requires — like picking one adjective out of the English dictionary, selecting which of a list of online advertisements a user is most likely to click on, or perhaps some state space much larger — we are only constrained by the availability of training data, the size of our neural net, and its respective computational requirements of training.

It’s not an obvious thing how a collection of layers of neurons whose interactions have unique weightings based on some training on labeled data can generate a classification. To clarify this point I think it will help to lift an image off of Wikipedia, so will just help myself.

Collection of some data with border for classification, transformed via progression through a neural network to an output state with improved grouping characteristics— image source Wikipedia

The dots here represent our data points interspersed as vectors throughout some state space. For example, for our political example the x axis may be Fox News viewing habits, the y axis CNN viewing habits. For visualization purposes this type of illustrated example will usually be limited to a two dimensional grid as shown here, but in practice our real applications will usually have more dimensions, often considerably more. What the logistic regression algorithm is doing is determining some grouping for voting preferences, and the barrier shown represents the border along which different classifications are made. You may be wondering what is meant by the θ (theta) labeled center arrow between the two boxes of this illustration. I believe what is being conveyed here is that the grid on the left represents a collection of data state vectors given as input to a trained ML algorithm, the center arrow then represents the application of neuron weightings as that data is fed through the neural network (θ is a common symbol for the weightings between neurons), and then the grid on the right represents the collection of transformed data point vectors after working through the weightings to the output row. So after training an algorithm, we would expect the output grouping coherency of data point state vectors to be improved and the classification task easier. It’s not intuitive how a collection of neural net weightings between neurons enable this type of transformation of each data point vector state, but will offer this point as a hint of what is going on here. It is possible using just the addition of some +/- weightings on a neuron’s input, to recreate a type of Logic Gate — in the Deep Learning text they demonstrate the XOR gate, one from a universal gate set used by classical computers to run programs at the binary / transistor level of abstraction. Thus as data vectors in a state space are fed into a trained neural net and transformed to some end state, these neurons and their weightings are actually performing a type of superimposed computation to transform the vectors to a more coherent grouping in vector space — but not a computation programmed by a human but instead automatically generated through the process of training our neural net weightings with the help of labeled data.

XOR gate

Reverting back to the simple political classifier example, I’ll try to expand here on what exactly is meant by the “training” of a model. First I’ll use an old engineering trick thought exercise to describe the training process — will consider the training operation a black box and simply describe the inputs and the outputs. Once through that exercise will crack open the black box to dive deeper into the how’s and the whys of what is going on inside the training of our networks. The inputs to a training operation’s black box are first a set of labeled data. For our simple example, the labeling will be a collection of citizen voting preferences coupled with some corresponding collection of their respective media habits or social circle. The more of this labeled data the better, we can never have too much data (computational considerations of dealing with the largest data sets I’m not as well versed on, but at least from a ML model accuracy standpoint one of the easiest ways we can improve a model is to increase the amount of labeled data used in training). It’s actually typical to carve the full original labeled data set into three buckets — the largest intended for the initial training run, the other two smaller buckets for post initial training purposes such as validation/tuning of parameters and testing the final model. After the data, the second input to our training operation black box will be the architecture of the ML neural net model used. We’ll have to decide in advance parameter properties of our neural net such as how the neurons are connected, the number of hidden layers, and the depth of each layer — more elaborate models such as those discussed in the second part of this post will have many other features requiring specification, but here we’re sticking to the most basic elements. Given these inputs, we can run our black box trainer, which will then output some collection of weightings for each neuron interaction. We’ll know immediately how this model performs on the original data. We wouldn’t necessarily want 100% accuracy on classifying the original data as this would be a strong indication of overfitting (aka high variance error) — an important concept that we’ll get to a little later, but just to be clear overfitting is something we do not want. Given these weightings, we can then test and tune the predictive capability of our model on new data by inputting some additional labeled data that we had set aside from the original collection for just this purpose (the second, smaller bucket mentioned above). By comparing accuracy of the original data vs. this second set we can test for things like whether our model truly has predictive capabilities for data outside of the original set and also tune our neural net parameters based on comparison of performance on our original data vs. our test data. Once the tuning of neural network parameters is complete, the third bucket of data is then the final test for our model accuracy. The process of initializing an architecture, gathering and prepping data, training, tinkering with parameters or architectures, and retraining until satisfactory performance achieved is what the practice of ML engineering is all about.

Life is either a continuous process improvement, or a terminal disease that we will all die from anyways.

- Randy J. Hinrichs

So we made it through our black box exercise, but it seems the more questions attempted to answer here the more pop up. What’s going on inside that black box of the training operation? What is overfitting and why is that a problem? Where in the world is Carmen Sandiego? What is the meaning of life? I’ll try to run through these questions in order, so will will start as promised by prying open the training operation black box and describe what exactly is being accomplished here. I mentioned earlier that the operation of training our neural net is an optimization exercise, to derive the value set of unique weightings between each of the interacting neuron nodes which produce the best prediction or evaluation capabilities in the neural net output for some as yet unseen data. So what exactly are we optimizing? Well basically (try to follow along here) we’re trying to minimize or at least reduce, for a collection of our labeled training data, a cost function comparing the sets of predicted outputs vs the correct training labels — one such cost function could be the sum of the set of deltas between our predicted values based on data point state vectors run through the neural net’s set of neuron weightings (this set of weightings is our variable for the optimization) minus the actual values based on our labeling of same data, or another more common could be a function derived using logarithms of the two sets known as cross-entropy. The second piece of the optimized formula is for purposes of “regularization”, I’ll discuss shortly what that means. There exists a “fitness landscape” of potential neuron weighting variable states, our optimization attempt is the process of working our way through this set of potential values to reach a low point in the axis of minimization. I’ll go ahead and grab another image off of Wikipedia to illustrate:

A fitness landscape with illustrative paths of optimization attempts — image source Wikipedia

For this illustrative picture of a fitness landscape optimization, the x and y axes could represent the weightings between two separate pair of neurons, say θA-B and θB-C, and the z axis the value we are trying to minimize as described above. As was the case for one of our earlier illustrations, the dimensional properties are set here to a bare minimum of 3D for visualization purposes, in practice the number of weighting variables to consider would likely be considerably higher and so then would be the dimensions under consideration, but unlike with machines, people’s ability to comprehend (much less visualize) multidimensional interactions can be challenging for dimensions much higher than what we experience in physical space — this is one of their big advantages over us humans (those who gain practice in linear algebra matrix manipulation may gain some intuition for higher dimensional models, but will still never get close to a machine), thus the need for a simplified representation. Before starting the optimization process, we’ll want to initialize our weightings to some random values or more preferable if we have another model trained on some comparable application that second model’s weightings. As for how we go about reaching the minimum, there are multiple optimization algorithms to choose from although currently those in use are all a form of backpropagation. In backpropagation the data points are initially fed through a randomly initialized set of weightings, then a gradient for the slope of the fitness landscape is derived in a derivation that works from the output of the model backwards to the first layer. Using this calculated gradient, each of the weightings is updated in direction of slope, and then the process is repeated first working forward through the model to evaluate current cost function and then backwards through the model to derive weighting gradients for updating the weights. There are several variations on backpropagation algorithms such as mini-batch which only evaluates a randomly selected batch of data points in each epoch, stochastic which only evaluates a single data point at a time (batch size = 1), or other more elaborate algorithms which incorporate concepts to steer the optimization path such as momentum and root mean square. Some key challenges for any optimization algorithm include saddle points where the slope of the fitness landscape may be zero along a particular axis even though it is not a minimum point along every axis, which will either cause our optimization path to get stuck or at least to slow the path to reaching a satisfactory low point (and thus increase the computational toll of our training), or alternatively an optimization path getting stuck in a local minimum, sort of a valley surrounded by mountains even though some other valleys may have lower state. Another common consideration for optimization algorithms will be the step size for following a path through the landscape (known as the learning rate), a step size too large may cause us to bypass lower points, a step size too small may slow down the training process. One potential optimization algorithm backpropagation alternate of interest I’ll just mention briefly (as I am not an expert) is simulated annealing or it’s quantum computing equivalent of quantum annealing — I’m working on some slides for a presentation on quantum computing which include some comments on this particular approach and expect will post in this venue in coming weeks.

One only needs two tools in life: WD-40 to make things go, and duct tape to make them stop.

- G. Weilacher

A core challenge of the training optimization problem goes back to one I brought up earlier — overfitting. Put simply, overfitting is what happens when our neural net’s weighting is overly adhered to the specific properties of the data in our original training set, and thus loses it’s generalization capabilities for evaluation of as yet unseen data — this is a condition of high variance (as opposed to bias error which will show up in both training and test evaluations). As an extreme example, a classification model that is completely overfit is one that only has capability to recognize data points in the original training set but no predictive capabilities outside of that set. Some ways to address overfit include what is called regularization (this is what I was referring to in the description of the training process), which could be some additional figures added to the optimization algorithm which has an effect of handicapping or dampening the fitting of training to the original data set by restricting the weight magnitudes— this will be one of the parameters likely requiring tuning after the initial training run. Another approach to regularization is to randomly ‘dropout’ a percent of the neurons in each training run which forces the model to train alternate approaches to learning in its superimposed computations. We can also intentionally prevent the optimization algorithm from reaching global minimum on the fitness landscape such as by an early dropout from the optimization run once evidence of variance is detected.

There is no single correct way to train a neural net. You tinker, you experiment with parameters, hopefully feed quality data and input, and eventually you might find yourself with a model with capabilities approaching or even better exceeding your own. Like parenting, it is more of an art than a science.

Life must be lived as play.

- Plato

Well dear reader, you’ve made it this far and I thank you sincerely for offering me your ear (or your eye). I thought about trying to break this essay into multiple posts, but truth is am having too much fun so am just going to keep on marching forward. If you’re starting to glaze over a bit this is a reasonable point for an intermission, go ahead don’t worry go ahead and like this post so you can find me and I’ll be here when you get back. Our second half will deal with modern, more specialized neural networks and potential applications.

Part 2 — In Which We Continue Talking About Machine Learning and Stuff (and wrap things up)

Dizzy Gillespie — Swing Low Sweet Cadillac (Muppet Show version)

A complex system that works is invariably found to have evolved from a simple system that worked.

- John Gall

We started this journey by introducing some of the most fundamental ML concepts, algorithms, and practices. Those we address in this section will largely be derivative of these building blocks, but they will extend capabilities of the original neural net to capture more specialized behavior that may prove useful in specific applications.

Will first turn to a prominent application that gets a lot of attention — that of self-driving cars. A large part of the control of a self-driving car is derived from images such as those generated from video camera or LIDAR (a laser based 3D imaging system). The processing or interpreting of images is a challenge that has evolved an eloquent solution again derived or at least inspired from brain function — convolutional neural networks. When the brain evaluates input from our vision, it does not process the image of an entire range of view simultaneously, it’s focus at any given instant on a more narrow range I believe selected based on conditions such as point of attention, movement, or other psychological heuristics. The instantaneous attention range of our artificial convolutional networks are more systematic, with one of the parameters for operation being the size and step range for a smaller window swept systematically through the entire image on each frame of a video, and our neural net limiting focus to that smaller swept window but evaluating the entire frame from the process of sweeping through the field. Each subsequent hidden row of the corresponding neural network, as the graining of the window is increased, will be expected to pick up features of increasing complexity from the image — for example an early row might detect edges, a subsequent row might categorize those edges into types of shapes, a late row might categorize those shapes into classes of roadway features / vehicles / pedestrians, and the output row would generate the inputs to automobile operation. Although to be clear the programming of these neural net hidden rows won’t be done by intention of programmers, all the people are doing is creating the architecture of the network connections and associated parameters, providing labeled data, and then running the system through the training optimization problem — it is through the training that the machine will derive / learn on its own what features to look for and in what order and corresponding actions for operation of the vehicle. The labeled data that we feed into a convolutional neural network for the purposes of training our self driving car example would likely be recorded videos from operation and corresponding details of interface points (steering wheel position, speed, acceleration, braking, blinkers, horn, windshield wipers, etc.) from a collection of human drivers (although it is also possible that one could increase the amount of available training data significantly by generating virtual driving experience data through a computer simulation, I am not sure to what extent this approach is currently in use by industry participants). The challenge of extending self driving cars from current use cases such as interstate driving to fully autonomous operation on all terrain stem partly from philosophical questions such as the trolley car problem extended to analogous driving conditions, although probably a bigger consideration is the extended range of complexity and potential outlier conditions outside of a more controlled interstate setting. The rarer a condition, the more likely to be missing comparable situations in the training data. Full rollout of autonomous vehicles outside of interstate driving may require some improvements to ML algorithms to allow them to be trained and infer generalizations from less specific corresponding training data, or alternatively perhaps simply increasing the amount of training data available by an order of magnitude or few will be sufficient.

“Yes,” said Deep Thought. “Life, the Universe, and Everything. There is an answer. But,” he added, “I’ll have to think about it.”

- Douglas Adams, The Hitchhiker’s Guide to the Galaxy

Interpreting and interacting with an environment has until now in this discussion required collecting labeled data in advance for training purposes. For cases where we don’t have data in advance but want to develop a model for addressing an environment, there is a tool known as reinforcement learning (RL). The training tool for reinforcement purposes will be some kind of metric or key performance indicator actively updated based on actions directed from the model. The algorithm will operate by exploring initially random potential interactions with the environment, and then those that produce an improvement to the training metric will reinforce the originating behavior. One popular demonstration tool for this style of learning is a typical childhood hobby of video games. The points collected in a video game such as say from collecting coins and mushrooms or jumping on turtles is well suited to serve as such a metric. When the machine starts the learning process, it’s not even programmed to know that the arrow buttons make a figure move or what button jumps etc (unless say the weightings were initialized from a training of some comparable game in which case the machine may then have an early clue), all of this is discovered independently by the algorithm by random manipulation with the points of interface all while watching the point tally metric for confirmation of whether an experiment is worth repeating. Through this learning process a machine may find its way to some fairly elaborate strategies of gameplay, although no word yet if a computer has been able to independently discover the cheat code Up Up Down Down Left Right Left Right B A Select Start.

You sort of start thinking anything’s possible if you’ve got enough nerve.

- JK Rowling

Image interpretation or manipulation is a handy illustrative mode of ML application, and we’ll eventually get to a discussion about what we can learn about other analogous applications from what is demonstrated through image examples, but first want to touch briefly on the data source of images as input to a training run. The application of extracted general features and properties of data can prove challenging when you consider real world variations of lighting conditions, shadowing, perspectives, obstructions, and line of sight. One practice that has helped to deal with this variability is to multiply available training data by applying various transformations to copied training images and adding these results to the labeled training data set. For example, we could potentially rotate images, mirror them, obscure features, change the brightness, or even overlay with random white noise. By increasing the amount of training data with these transformed images, we expect to improve the resulting accuracy of our model in its extraction and interpretation of key features.

The game of life is not so much in holding a good hand as playing a poor hand well.

- HT Leslie

While the convolutional neural network was one that evaluates and updates output based on a frame by frame analysis, one at a time, there are some applications where we may want to generate output based not just on the current input state but also as a function of prior conditions and their progression over time. Some examples could include evaluating investment criteria as a function of evolving market conditions or perhaps providing movie recommendations to a user over the years when they mature from childhood to a young adult and their resulting change in tastes. Another important example is language interpretation, where we would miss out on the meaning of a passage by looking at every word in isolation — it is also through the grouping, order, or repetition of statements or themes that meaning can be inferred. The appropriate neural network variant for this type of transitory data feed is known as a recurrent neural net. A recurrent net is fed a progression of data, which could be the continuous progression of stock prices or alternatively the discrete collection of intermediate states such as a user rating their favorite PIXAR movies. The defining feature of the recurrent net architecture is found in how when producing an output for given input state, the neurons are subject to the influence of not just those preceding in the current time state input but also their corresponding neurons values from the prior time step. The influence of state from the immediately preceding time step, which itself was influenced by the time step before and etcetera in a recursive fashion means that our model will have a kind of memory of the progression of states over time, and thus capable of acting on such evolution. The architecture I describe does have a limitation worth note. As the time steps progress, influences of states much prior to the current time step progressively weaken with each iteration — this is known as the vanishing gradient problem (or alternatively can generate an equally obstructive exploding gradient). In order to address this issue, architects have created a simple solution with an oxymoronic name — Long Short Term Memory, or LSTM for short. The idea is that a channel is built into the architecture to ensure features from earlier time steps has ability to reach and influence current states, a time capsule of sorts. It is also possible, although less popular, to set parameters to include cells of arbitrary duration of memory, say you want the model of investment criteria to have some influence from market behavior in prior decades but more weighting in the most recent months, you might then derive a preferred weighting of various memory nodes (of course given current styles of active investor holding periods the time scales under consideration may be considerable shorter than decades or months — this is why a retail investor is best served by long term holding periods such as with a S&P 500 index fund, they’re less competing with the algorithms in this approach).

image via WALL·E

Each of these specialized algorithms from this section on their own address some niche of problem style. The state of the art and expanding horizons of research are taking place not only in the creation of new silos of this nature, but also in the combination and interactions between multiple of these approaches. One new such amalgamation that has demonstrated some eye-opening results is known as generative adversarial networks (GAN). In this approach the goal is to use a trained network to generate new data that is lifelike and representative of the properties as would be found inside of our training set. For example one may wish for the machine to generate pictures of imaginary birds based on a textual description. Or alternatively we may want to generate an image of what some teenager would look like after getting glasses, braces, or a new haircut by extracting properties from multiple images and combining them in some textually defined description. These are all possible with GAN. The generation is achieved by pairing a generative algorithm with a classifier algorithm to serve as a kind of reinforcement teacher. As the generator attempts life-like creations, the classifier reinforces those aspects that can pass for real data and rejects those that do not, so the two aspects have a kind of adversarial back and forth competition. Through the interaction of the two the generated data approaches much closer to the representative training properties than what we could achieve otherwise.

via “Generative Adversarial Text to Image Synthesis” by Reed, Akata, Yan, Logewaran, Schiele, and Lee — Link
via “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radfgord, Metz, and Chintala — Link

The demonstration of ML applications and tricks through image or video generation and manipulation are common among researchers, they make for good virality as are easy to post and share on social media. But when you come across these demonstrations I suggest you try to look past the pictures to consider what this capability demonstrates in general, beyond simple images. Just as our earlier charts and graphs in first half of this post were paired down to the simplest properties for visualization purposes even though the algorithms are capable of operating at much higher scale of dimensionality, from the demonstration of ML capabilities through images we can infer capabilities in other modalities and potential increases of sophistication and higher dimensional scales. By modality I mean the categorization of ML environment of operation such as the differences between images, video, speech, text, language, sound, music, market actions, or even eventually higher order considerations such as writing style, reasoning, emotion, or personality traits. If there is a noun to describe something, expect that we will be able to classify and depict it. If there is a defining feature or characteristic of some data, expect we will be able to extract it even if we can’t necessarily describe it in our limited vocabulary or three dimensional imagination. If there is an adjective for a trait, expect we will find a way to measure or manipulate along that axis (if not yet or yet with very high fidelity then eventually). If there is a metric to gauge the success in an environment against some goal, expect we will be able to generate and improve via reinforcement avenues to reach that objective. It will even be possible to translate between different modalities, as we demonstrated above from converting from a textual description of bird to generated images, just like translating between languages is done today.

excerpt from Yoshua Bengio NIPS 2015 Deep Learning Tutorial presentation — Link

The key input for all of these deep learning techniques and potential applications has been sufficient volume of labeled data to feed into the training algorithms. Our computers can learn, but for these techniques they require orders of magnitude more data than human children and their developing brains as they grow from a helpless newborn to needy toddler to curious child to rebellious teenager and finally to young adult with ideas of their own. Our brain has the capacity to generalize even from unstructured / poorly labeled data with very few points of training. The internet economy has fed some lucky recipient gatekeepers scores of crowd generated text, pictures, video, and purchasing histories — putting incumbent platform owners at a significant head start against upstarts and new entrants using today’s algorithms. The next paradigm of machine learning, which is already in view, will be able to learn and extract properties from even unlabeled data.

When a person can no longer laugh at himself, it is time for others to laugh at him.

- Thomas Szasz

Peter Drucker, who foretold the advent of the knowledge worker class, describes in his book Technology, Management, and Society that “computers are morons”, “not capable of making decisions, only carrying out orders” — echoing the first computer programmer Ada Lovelace in that sentiment. But it is foreseeable that ML could one day disrupt this assumption, and approach human level generalization and comprehension capabilities even from sparse data as it crosses the bridge from ML to AI — the fact that our brains have this ability means it is at least possible. From this future generation of algorithms, coupled with scales of neural nets approaching human brains, there could be born an emergent intelligence with the capacity to exceed meta domain capabilities of its parents, we humans. The caged pet that we have been training to perform tricks, do chores, and leap through hoops could someday jump from its pool and who knows maybe even give birth to unexpected objectives of its own.

Any life, no matter how long and complex it may be, is made up of a single moment — the moment in which a man finds out, once and for all, who he is.

- Jorge Luis Borges

The Smothers Brothers — The Saga of John Henry

*For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations.

Books that were referenced here or otherwise inspired this post:

Deep Learning — Ian Goodfellow,‎ Yoshua Bengio,‎ and Aaron Courville

Deep Learning

Artificial Intelligence: What Everyone Needs to Know — Jerry Kaplan

Artificial Intelligence: What Everyone Needs to Know

The Master Algorithm— Pedro Domingos

The Master Algorithm

The Second Machine Age — Erik Brynjolfsson and Andrew McAfee

The Second Machine Age

The Hitchhiker’s Guide to the Galaxy — Douglas Adams

The Hitchhiker’s Guide to the Galaxy

Technology, Management, and Society — Peter Drucker

Technology, Management, and Society

Superintelligence — Nick Bostrom

Superintelligence

(As an Amazon Associate I earn from qualifying purchases.)

Albums that were referenced here or otherwise inspired this post:

LCD Soundsystem — LCD Soundsystem

LCD Soundsystem

Glassworks — Phillip Glass

Glassworks

Swing Low Sweet Cadillac — Dizzy Gilespie

Swing Low Sweet Cadillac

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m an amateur blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share. I can also be reached on linkedin for professional inquiries (currently seeking work) or twitter for personal.

For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations.

--

--

Nicholas Teague
From the Diaries of John Henry

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.