What is Machine Learning: A Point of View

André Sardao
Nama Blog
Published in
12 min readJan 8, 2018

--

1. The size of the problem

Imagine the following scene: we have at the center of a football pitch an asymmetrical object, composed of different parts. You may think of this object as a set of irregularly shaped sculptures. Suppose that, distributed in a disorderly fashion over the edges of the field, a few people observe such an object. Assume that each of these observers describes what they see. It is only natural that we expect these accounts to be different from one another to the extent that they may not seem to be about the same object. This disparity is caused by divergences between the different viewing angles. None of the accounts alone is sufficient to give a complete description of the observed entity. If we wanted to obtain a complete knowledge of the observed object, we should reconcile all the standpoints we have.

Our goal here is to answer an interesting but capricious question: “What is Machine Learning?”. First, it is necessary to keep in mind that there are different types and approaches to Machine Learning. Moreover, once Machine Learning is an outcome from efforts derived from distinct areas of human knowledge, we are prone to find a considerable amount of viewpoints on the subject. The point of view of the computer scientist may differ from that of the statistician, and the electrical engineer’s may be very different the philosopher’s. However, all of them must describe the same thing. And none of these points of view alone can provide a complete picture of the whole situation. For example, the more technical and applicable view of the engineer will lack the broader and more theoretical considerations of which a philosopher is capable. We have a situation analogous to the parable of the football pitch described in the preceding paragraph.

For those reasons we could never dream to give a definitive answer to the question posed. Instead, we could try going for one that goes over the little science lightly, within the limits of the fidelity to the subject.

It is normal to define Machine Learning as recognition of patterns supposedly present in extensive amounts of data, to arrive at a rule that would allow us to “make predictions” about new examples cropping up from the same context. We will not throw away this norm: in addition to illustrating it, we will argue that this definition is a natural complement to the scientific method. To exemplify what this scientific method would be, I see here the opportunity to speak a little about Physics, which perhaps is, quantitatively and conceptually, the most successful example of science.

In general, “learning” can be described as the process of transforming a set of information, obtained in some way, into knowledge. This definition of learning conforms to the above definition of Machine Learning in the following terms:

(i) Some “information” is extracted from a set of data.
It is crucial that the components of this dataset follow some underlying logic, meaning that they all originate from the same population. For example, in the context of image recognition, the data set would consist of thousands of pictures of cats, each one of those properly labeled as “cat,” plus thousands of dog pictures labeled “dog,” plus thousands of photos of cars with the proper labels and so on.

How we teach computers to understand pictures | Fei Fei Li

(ii) The “process” is called training.
It is where, through the use of some of the so-called “Machine Learning algorithms”, the patterns in the data are recognized. Often times such algorithms are versions of already well-established procedures in Statistics. In Machine Learning, the programming paradigms are different from those of traditional programming; the logical flow is controlled more by goals set by a mathematical optimization process than by strict rules imposed by the programmer. Continuing with the example of (i): the model learns through a kind of “trial and error” process. The optimization algorithms “punishes” the model whenever a mistaken classification takes place, i.g., when it says that it “sees” an image of a dog but it is actually of a cat. The training process is considered to be complete when the model finally “learns” to classify the objects images present in the data set the best it can. This “best shape” is achieved when the model ceases to improve its performance.

A.I. Experiments: Teachable Machine by Google.

(iii) “Knowledge” is obtained in the following way: once we have a trained model, we can hope that future cases will present similar patterns. Similar, but not necessarily the same, because they will most likely exhibit variations within the underlying “logic” of the examples we have already seen in training. From “understanding” these possible variations, in a statistical sense, derives the ability to make predictions about future situations. Following the example of image recognition, assume that the trained model is presented with a picture of a cat that was not in the original dataset. If the training was performed properly, we could expect the model to classify this new photo as being really a cat correctly.

2. A case of success

For millennia humankind looked up at the sky, confectioning astronomical maps. In modern terms, one can say that these observations sought to collect data to recognize the patterns of movement of the celestial bodies. Based on these primitive observations, the Babylonians were able to predict, with considerable accuracy, Lunar eclipses. Centuries later, the Greek mathematician and philosopher Thales of Miletus would be credited for being the first man to predict a solar eclipse. Here we point out that both the lunar and solar cases were achieved long before any knowledge of even the most basic principles of Physics had been established.

However, we realize that the ability to predict lunar eclipses proved itself to be fruitless for the case of predicting solar eclipses. Otherwise, we would not have to wait for hundreds of years until Thales, who was the first to predict a solar eclipse: the Babylonians would have taken advantage of the knowledge they had on the lunar instance and extrapolated it to the solar case. That is to say: if these two phenomena follow the same physical laws, why did the solution of one case not immediately allowed the possibility of solving the other? For, as noted in the previous paragraph, at that time no fundamental law explaining these phenomena had been discovered (the earth was still believed to be flat).

The prediction power of eclipses in the Babylonian era derived purely from inferences based on the periodicity spotted in the observations (the “Big Data” of the time). For example, those lunar eclipses were repeated every 18 years, ten days and eight hours (i.e., the Saros cycle). This type of information applied only to particular cases, without being useful in general ones. It is just natural: if you bought a two-pound salmon today, it does not mean that all salmon will always have the same weight. That is, to recognize the pattern of the period with which a specific manifestation of a particular phenomenon repeats itself is not sufficient to describe the periods of all different possible expressions of the same type of phenomenon. In this case, something else has to be learned. In Machine Learning, the worst thing that can happen is when your model fails to serve general situations. In the salmon case, you would need to buy far more fish over a much longer of time to be able to infer a distribution for salmon size.

Physics seeks to understand the nature of “objects” as matter, time, space, energy and, essentially, how the interactions between such “objects” work. To these things, and to the relation between them, we give the name of “physical phenomena”. In other words, physics aims to discover the fundamental principles that govern these phenomena. These descriptions should express in the form of basic laws, usually by mathematical formulas. Typically such principles are induced from observations (these done in the past, which correspond to the “data”), Newton called it inductive reasoning, and must have their accuracy tested against future observations (this means that the knowledge contained in these principles and formulas should suffice to describe what will happen in future occurrences of the same phenomenon). That is to say, it is crucial that such laws are consistent with both past and future facts. The ability to predict the outcome of future observations in the Machine Learning world is called “power of generalization”. In the example of the image recognition task, a good power of generalization can be illustrated by the capacity of our model to correctly classifying unseen images, i.e., pictures of cars, dogs, cats, etc. that were not part of the set used for training.

Physics has been very successful in this endeavor. For example, Newton’s theory of universal gravitation (as well as, more generally, Einstein’s general relativity) can describe the trajectory of the Earth around the Sun (i.e., Earth’s orbit). The physical objects involved here are the Sun, the planet Earth, their respective masses, the distance that separates the center of one body from the center of the other, and finally the interaction between them. The orbit of the Earth follows a clear pattern: it describes an ellipse around the sun, that takes one year to be fully completed. Based on data from astronomical observations made over millennia and following the efforts of Copernicus, Kepler, Galileo, Hooke and others, Newton understood, at least in the first approximation, the general principles governing gravity: the force is proportional to the product of the masses and inversely proportional to the square of the distance between their centers. This made it possible to describe the patterns present in both the “fall of an apple” and in the motion of all planets (except for Mercury, which had to wait for more than two centuries until Einstein presented a theory that “explained” its orbit) of the solar system and also of its moons, comets that approach us and so on. This is a case of successful generalization.

In Machine Learning, the quality of your algorithm must be measured by its capacity of generalization.

Machine Learning attempts a third way: through an extensive data analysis, it tries to acquire generalization skills similar to that of physics, even when the fundamental theoretical principles are unknown.

3. If only everything was that simple…

Have you ever thought about how an electron, as a structure, is much simpler than a human being? Or of how the problem predicting the total falling time of an apple from the top three meters tall tree is a much easier one than trying to predict today the outcome of the next year’s elections?

In part, the success of physics could be explained by the simplicity of the phenomena which it deals with. In other areas, this simplicity is not often the case. While physics studies relatively primitive objects such as the motion of massive bodies and elementary particles, fields that study relations between more complex entities, such as those happening between human beings, tend to fail when it comes to determining the fundamental principles which govern such interactions. Our intellect usually disappoints when the data present great complexity. The latter can be caused both by its sheer quantity and by excessive complexity in its supposed patterns.

When we can not understand all the basic principles of the phenomenon of which we want to make predictions, Machine Learning offers hope.

Someone may point out: “But this is already one of the roles of Statistics (more precisely, Inferential Statistics), to provide scientific treatment to situations where we do not know the principles governing the phenomenon we are interested in.” Yes, that is correct. So would Machine Learning be a competitor of Statistics? Not really: Machine Learning and Statistics are allied in this endeavor to extract information from the dataset, which may serve as the basis for good decision making. Formally, they are together by the “”.

The point is that once the principles governing a phenomenon are known, we immediately know the patterns generated by this phenomenon. On the other hand, recognizing patterns does not necessarily imply immediately knowing the fundamental principles behind the phenomenon. But most of the time, recognizing patterns is the best we can aspire to.

Jason Silva | To Understand is to Perceive Patterns

And here, when the amount of data is too significant and its nature too complicated, Machine Learning can be of great help. That is to say, Machine Learning is an extension of the scientific method in the sense that it hopes to assist us in obtaining knowledge even when we are unable to understand the phenomenon we are studying fully.

For example, it is indeed very complicated to try to explain what makes a dog, a dog, or a cat, a cat. We, humans, know how to differentiate the two animals more synthetically than analytically. That is to say, when we recognize a dog, we do it automatically, without having to resort to any complex analytical processes. Actually, even a dog can make such a distinction. Or, in the same vein, when you meet someone you know, how do you know who it is? How can you tell that John is not Joseph? Do you measure the distance between the eyes or the thickness of that person’s lip? No, you merely recognize him or her. We only learned to repeat a classification process, without knowing how to describe it precisely. In other words, it is a piece of knowledge that we obtained empirically.

For an image recognition Machine Learning model to succeed in the task of classifying whether a given image is of a cat or a dog, it needs only to be able to replicate the classification process after being exposed to zillions of examples, i.e., adequately labeled photos of these animals.

In other words, when it is difficult to find the basic principles governing a given phenomenon, our next natural move is to try to explain the patterns found in our observations as being generated by some well known “statistical rules”, usually called “probability distribution.”

Long before the birth of Machine Learning, Statistics was a successful fully developed area of Mathematics and was already widely used in scientific research in a wide range of fields. This means that these probability distributions had already been largely tested in applications.

4. A Certain Personal Disappointment

I must confess that I was absolutely fascinated when I first heard the term Machine Learning.

What is the point of saying that a machine learns? Does the machine get happier when it learns something? Does it understand what it has learned to the point of being able to extrapolate such knowledge into entirely different contexts? Do the people doing this Machine Learning thing know something that members of other scientific communities are still to find out?

I must admit that, when I began to understand what it is about, I felt bit blasé: classical “machine learning” is often about relatively old statistical techniques allied to the processing power of today’s computers.

And deep down, I expected something a bit more magical. But it is also a relief that things turn out to be as they are: Machine Learning is nothing more than a natural evolution of the search for knowledge that mankind has keeping for thousands of years.

Boston Dynamics

And why do we say that the machine learns? It learns in the sense that it does not need hard-coding programming, i.e., the programmer does not need to predict all kinds of situations that happen in that given context, for example: “if A happens, then do B.” Everything is controlled by a mathematical optimization process that is chosen by the programmer. The type of statistical model that it learns is also supposed in advance by the programmer, but the machine, with the aid of the optimization algorithm, learns to adjust the necessary parameters.

So, essential in (supervised) Machine Learning is the introduction of an “error function”. The objective of the optimization process is to make the value of this function as small as possible. If the model “sees” the picture of a dog and “shots” that it is a car, this causes the error function to have its value increased, which is not good, once it is contrary to the objective. The optimization algorithm (the most used one has its origins more than two hundred years ago) then forces the model to adjust its parameters so that the next time that the same picture shows up, the guess is not “it’s a car”. Normally, the parameters are in a colossal quantity and must be adjusted, at each step, in a way that the general error decreases, and not just the error attributed to that particular photo. On the other hand, whenever the model hits the label, the error decreases. When the error gets as small as possible, the training is complete.

In an ideal world, we will have learned what is the probability distribution to which our data belong. So the model is ready to assist us in our tasks.

Perhaps, at the end of the day, we are the ones who learn through the machine. Or it may be that everything I wrote above is bullshit: it may be that indeed the machines learn the fundamental principles behind the phenomena they examine and that they have decided, at least for the time being, not to tell us their secret!

Follow us on Social Networks:

💻 Facebook 🐤 Twitter 📈 LinkedIn 📷 Instagram 🎬 YouTube 🤖 Website

👨‍💻 Be a Nama Guest Writer 👩‍💻

--

--

André Sardao
Nama Blog

I’m a Machine Learning person at Nama Softwares. My interests range from Mathematics to Music, Machine Learning to Philosophy, Physics to Poetry and a lot more.