Machine Learning — The New Programming Language

Raul Incze
Cognifeed
Published in
12 min readJun 16, 2018

Like many of you probably do, as soon as I get a “new” idea I instantly google it. I’ve got burned before, way too many times, getting hyped about a “new” product, a “new” research direction or a “new” socio-philosophical statement, just to find out that it’s been already thought of, or implemented. So now the google search precedes the hype…

But today I am hyped. For a few good reasons.

I’ve been having this seemingly crazy idea lately: that we’re on the verge of a paradigm shift for the way we are telling machines (computers) what to do. From programming to machine learning. It feels so obvious and natural to me that I didn’t even bother sharing this thought with others. Yet recently I started to… And every time that I tell this to people I get a few raised eyebrows. I took this as a strong hint that I’m onto something, so I googled it… The top results were about the best programming languages to do machine learning in… Time to get hyped!

Machine learning will become the predominant programming language. What I mean by this is that we will no longer imperatively tell machines what to do through lines of code, but rather by teaching them how to react, respond and behave in certain scenarios. And in the next few paragraphs I want to roughly sketch a few arguments and try and convince you of this as well.

From machine code to machine learning

According to Wikipedia a programming language is “[…] a formal language that specifies a set of instructions that can be used to produce various kinds of output”. Which is quite accurate yet extremely generic. Let’s take a (very) short dive into the history of computer programming languages and try to extrapolate from there.

At first, the only way to tell computers what to do (aka programming them) was in machine code (binary) through the dreaded punch cards. Then assembly came along, bringing with it symbols, mnemonics and some very scarce semantics. While still very low-level, it made imperative programming a lot easier.

A picture most of you are probably familiar with: 4.5 megabytes of data in 62,500 punched cards, USA, 1955 (link).

FORTRAN entered the stage in the ’50s, being the first high level programming language that actually works. It included 32 semantic statements and other mathematical symbols to abstract the low level operations. During the late ’60s and through the ’70s the programming language shotgun fired, spawning a few of the main fundamental approaches to programming: systems programming ©, the first object-oriented language (Simula) and the grandfathers of logical and functional programming (LISP, Prolog).

During the following consolidation period, two of the main approaches emerged and then merged: systems programming and object oriented programming (C++, and later on Java, C# and most of the modern programming languages). The main pillar of their success was the ability to design system architectures that closely map our understanding of the world and its processes.

In the last few years weakly and dynamically typed programming languages such as python and javascript have seen an important raise in popularity and usage because of their friendliness and flexibility. At the same time, the good practices for writing code have shifted towards more of a narrative craft (Scott Mountenay, Writing Code That Reads Like a Story) in order to make code bases easier to understand and maintain. Programmers today are writing less code that’s easier to read than ever before.

A new paradigm is starting to take shape nowadays: differientable programming. The term was coined by Yann LeCun (original post here), Facebook’s director of AI research, in an attempt to distance the machine learning field from the “outlived buzzwordiness” of “deep learning”. Various languages (like tensorlang) start to appear, that are built on top of various machine learning libraries. At the same time the talk about “software 2.0” (Andrej Karpathy, Software 2.0) is gaining a lot of attention, where the software is written in neural network weights, rather than lines of code.

We can view the differientable programming paradigm as the low level component of a machine learning oriented approach. If we follow the trend of previous languages to become more and more abstract and close to our human way of modelling the world and of interacting with it (object-oriented, narratives), the high level component of such a take would be the act of machine teaching. Through machine teaching we would “program” machines by showing or telling them what to do.

We could achieve this by automating what programmers do in a low level differentiable programming language: the creation of computational graphs. There are already many approaches that are looking to address this problem, either by finding one machine learning model to rule them all, or by trying to find the appropriate model architecture through reinforcement learning.

The building blocks of general intelligence

It’s the thing post-humanists and researchers are dreaming about. An idea that marketeers and con artists exploit to raise capital and media attention. A prophecy that various outlets turn into a fear mongering click-bait machine. It’s artificial general intelligence, or AGI for short, also known as “strong AI”.

The current state of AI

Okay, okay… but let’s slow down for a moment, and talk about the current state of AI. According to Andrew NG, former chief scientist at Baidu at founder of Coursera, a good rule of thumb for where AI can be applied today is for automating tasks that a typical person can complete with less than a second of thinking. Of course, there are exceptions to this rules, but it is sound in most cases.

Current AI is extremely narrow and very specialized. It is great at solving “A→B” (input to output) mapping type of problems (note how close this is to the definition of programming). The issue is that most trained models only know how to solve an instance, or at best a subset of such tasks. We could, in theory, break down most of our intelligence in such granular task, but we would still be left the problem of modelling interactions between them.

Yet another problem is the amount of data needed to train any deep learning algorithm. Most of the algorithms applied and studied nowadays are various forms of learning statistical models and fitting curves to approximate functions. Therefore, the more training data we have, the more accurate our model will be (yet the rule of diminishing returns applies here as well, of course). This takes a tremendous amount of effort. Both to gather and to preprocess the data and then to meet the compute power required for training models on these huge data sets as well.

To address this, machine learning engineers and scientists use a method called “transfer learning” (Jason Brownlee, A Gentle Introduction to Transfer Learning for Deep Learning). This means that they (or somebody else before them) train a model on a task with a vast and comprehensive data set and then take some (or all) of the parameters learned by the model to use as a starting point for further training on their own, smaller set. Usually the task we fine-tune for is a subset of the task the model was initially trained on, but that is not always the case.

A few approaches on machine learning

What we talked about up to this point is called supervised learning, where we train an algorithm on (A, B) pairs making it learn an f function that maps A to B, where A is the actual data and B is the label, or annotation, for that entry. Unfortunately, in real life scenarios we have a lot of As and not that many Bs… Fret not, as unlabeled data can still be useful. We can use unsupervised learning algorithms to find underlying structure in the data or to extract various features that we can then use to train supervised algorithms faster and on less data.

Somewhere between these two main approaches we find a strange, yet fascinating take on ML called active learning. Here we start with a very small number of labeled instances and a way bigger data set of entries with unknown labels. The algorithm is trained in a supervised fashion on the labeled data and then tries to predict the labels for the rest. It then forwards a query to a human “oracle” with the annotations that the trained model is most uncertain of. This way important instances for the training process get to be labeled early on and the model converges to a solution quicker (Burr Settles, From Theories to Queries: Active Learning in Practice). The “teaching” process is quite similar to the adorable teapot anecdote by Eric Jang.

The process of active learning

But what if there is no human expert to answer to these AI-generated queries? And what if there’s little to none data from which an algorithm or agent can learn? Well, that’s fine too. We can train algorithms that learn from past experiences, from their mistakes and successes, guided by some human-designed reward policy. This is called reinforcement learning. But it is no silver bullet. For models to converge and learn a task they require a significant number of experiences (or samples), while being also quite inefficient at learning from them. A good article describing the biggest pains of this field is Deep Reinforcement Learning Doesn’t Work Yet by Alexander Irpan. I encourage the more ML-savvy of you to read it.

Data on its own might be too dumb

Yet not even solving reinforcement learning guarantees the spawning of AGI. A few of the issues with RL might be addressed by a shift from treating models as “tabula rasa” before the training occurs (note how transfer learning is a shy step away from this) and defining an innate machinery to machine learning. You can find a fascinating debate on this subject between Gary Marcus and Yann LeCun here.

At the core of everything we focused on in the last few paragraphs reside various statistical models that find correlations in the data and express a result based on said correlations. This has been the main focus of the literature and industry for the past few decades as well. Some would expect that scaling up current methods and models, fueled by significant future advances in compute power will emergently lead to some form of true intelligence, and it is hard to argue the contrary. Yet to reach that point, I personally believe that we need a breakthrough in the way we process data, and we are currently quite far from that.

There might be a shortcut to AGI by mimicking our intuitive model of the world through the means of causal inference. A causal reasoning model is a departure from the classical, purely probabilistic ones. A good example that illustrates the difference is the one given by Judea Pearl, one of the grandfathers of AI and main preacher of causal reasoning, in his latest book, The Book of Why: The New Science of Cause and Effect. A statistical model would learn that there is a correlation between the barometric reading and an upcoming storm, learning that if the barometer reads a low value there is a big chance for a storm. But the said algorithm would never know that the storm was not caused by the barometer going down. On the other hand, a causal inference model would be aware that the upcoming storm causes the barometer reading and not the other way around.

XKCD seem to have a comic strip for everything.

From my point of view none of these directions, in a void, will manage to achieve true intelligence, but they are all necessary building blocks for designing a human-like general AI. We could go on for tens of other paragraphs on how all of these blocks fitting in together, but I will leave that for a future blog post. But why this obsession with AGI and what does it have to do with my whole “ML is the new programming” argument? Well, once we reach such a level of intelligence most programming will become obsolete. We will simply communicate with machines as we do today with other human peers.

But until all of that happens, the technological playing field and our society might shift so much that this discussion will become irrelevant. How about the near future? Why do we need to shift from programming to machine teaching in the short term? It’s because automation is already here. We need to not only interact, but also collaborate with machines in order to tap into their true potential, on a large scale.

The need for human-computer collaboration

Automation is inevitable and well on its way. You probably already “know” exactly in how many years you are going to lose your job to robots, thanks to a plethora of websites predicting it. And while these predictions might not be accurate, sooner or later we will definitely get rid of any work we don’t actively want to perform.

Want is a key word here. Even if there’s nothing stopping future machine intelligence from automating all tasks related to most careers, we might still find certain activities intrinsically valuable, even if all extrinsic reward for having a job will perish. We will probably continue to find a greater purpose and meaning in our work. Of course, labour will be abolished almost entirely and we will see an interesting shift from profession to vocation. Most of the people still working will do so in creative fields (including science).

Yet if we want to work, there is a strong chance that we will also need to collaborate with other entities performing the same type, or adjacent work. And the bulk of these entities will be machines. We do have some tools today to interact with them. We have the entire field of Human-Computer Interaction (or HCI for short), a sub-field of computer science, studying this vast topic. But most of our interaction with machines is severely limited. We can use visual interfaces (web, apps), natural language (personal assistants, chatbots) or even brain-computer interfaces as inputs. Despite the apparent variety and depth of these interactions, they are all sandboxed and restricted to triggering some routines pre-programmed into the system we’re interacting with.

Everything changes if we know programming. Programmers are sometimes referred to as being “the wizards of the future”, and for a good reason. They control the “magic” needed to tap into the true potential of silicon machines. Unfortunately not everybody has the resources needed to pick up programming from scratch, and even if everybody did, writing code in a programming language might be distracting us from the goal and reason for which we’re writing that code to begin with. Yet nearly everybody speaks a language or can give instructions by showing. And we already have the natural language processing toolkit or the ability to build interfaces that would enable this kind of machine teaching.

Machine learning could enable anyone to be a programmer by allowing them to modify the behaviour and functionality of systems according to their own will. Facilitating this meaningful and deeper interaction with silicon systems will pave the way for future human-machine collaboration. Under this scenario, the democratization of ML becomes imperative. We can’t allow a privileged few (that know how to code, design and train models) to control the whole automated workforce of the future.

CogniFeed and ML democratization

I’ve thought long and hard before using the term “democratization”. My reluctance draws from the fact that this term has become a buzzword as of late, almost left devoid of meaning because of its overuse. But I feel that this word describes the best what CogniFeed sets to do.

I’ve started to work on CogniFeed more than a year ago trying to build onto what various ML libraries — both lower level like TensorFlow, Caffe, Torch and high level such as Keras — are doing in terms of democratizing the field. CogniFeed is a platform that reduces most friction from training and deploying a machine learning solution. It enables non-technical people to train ML models as easy as they would teach a small infant.

At its core, CogniFeed combines unsupervised and supervised learning, gluing them together in a real time active learning scenario enabling you to train models in a matter of minutes with very little data and without writing a line of code. Right now we are focusing on classification and regression for computer vision tasks, but we plan to expand its capabilities in the near future. We’ll go into more detail in regards to the algorithms and core ensemble that enables CogniFeed’s capabilities in a series of future blog posts. Until then, you can find out more info on our ML platform by heading to our website.

Closing remarks

We’ve finally reached the end of our lengthy first Medium post. I’ve tried to argue for my initial statement starting from two different points of view. From a programming perspective and from a ML/AI one. We’ve seen how the trends in programming might extrapolate into a generic “show or tell” approach for instructing machines how to perform tasks. On the other hand we’ve noted how in the context of machine intelligence, programming would become almost obsolete. Both of these trends, fueled by the imperative need of democratizing control over machines will spawn a new paradigm of extending and enhancing the software capabilities of the silicon: a transition in popularity from programming to machine learning.

At CogniFeed, we want to be at the forefront of this transition, helping to bring ML at your fingertips.

Signing out until next time, ML freak and CogniFeed founder,

Raul

--

--

Raul Incze
Cognifeed

Fighting to bring machine learning to as many products and businesses as possible, automating processes and improving living experience.