[Deep Learning] Intro: Where does it came from?
Motivation — or the early beginnings
If you enjoy following tech trends, as well as the real-world applications that are based on them, chances are that you already have been exposed to Deep Learning, Machine Learning, or the more broadly-defined Artificial Intelligence. If you haven’t heard anything about those concepts before, I strongly recommend you going after ai-resources (like this one!), because…well, world will change.
Although statistical learning techniques have been around for quite a long time (maybe since Legendre’s Least Square algorithm from the 1800's?), it was only between 2012 and 2014 that the so-called AI-Winter came to an end. Academics were frustrated with the ability of generalization and scaling of algorithms such as ANN’s (Artificial Neural Networks) and other learning algorithms, mostly due to their unsuccessful takes on high-dimensional data such as images, videos, language and robotics (i.e. sensing and planning).
What happened in 2012, you may guess? Well, Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton, published an article, where they demonstrated how they used Convolutional Neural Networks to beat the ImageNet LSVRC-2012 contest, surpassing the other contenders by big margins.
Consider these excerpts from the article (bold formatting added by me):
The specific contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012 competitions [2] and achieved by far the best results ever reported on these datasets
And this one:
Our network takes between five and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
It might be not obvious to you why these excerpts are so important, if you haven’t been closely following the advances of Machine Learning and data science in general, but I will try to briefly explain them to you.
They used Neural Networks
Hey, I understand that (again, only if you are one of those deep-for-the-hype-learning guys/gals) now everyone is using them, and all the cool kids are all about Conv-Nets, LSTM’s, discussing dropout rates and novel architectures, but back them nobody was using it.
Ok, the big and noteworthy guys like Andrew Ng, Yann LeCun, etc. were already diving in the field, but it had NOT a major widespread usage, nor respect within the community, that’s for sure, and that’s why it is “ok” to say the stylized fact that nobody was really using it.
But don’t trust me, look at the competition data about the team’s algorithms. Were they alone at the competition? Not at all! There were a lot of other top-notch scientific groups competing for the state-of-the-art on computer image processing. Let’s have a look on their algorithms (on the classification task):
Team-University-Abstract
ISI — University of Tokyo — “We use multi-class online learning and late fusion techniques with multiple image features. We extract conventional Fisher Vectors (FV) [Sanchez et al., CVPR 2011] and streamlined version of Graphical Gaussian Vectors (GGV) [Harada, NIPS 2012]. For extraction, we use not only common SIFT and CSIFT, but also LBP and GIST in a dense-sampling manner.”
LEAR-XRCE — INRIA/Xerox Research — “In our submission we evaluate the performance of the Nearest Mean Classifier (NCM) in the ILSVRC 2012 Challenge. The idea of the NCM classifier is to classify an image to the class with the nearest class-mean. To obtain competitive performance we learn a low rank Mahalanobis distance function, M = W’ W, by maximizing the log-likelihood of correct prediction [Mensink et al., ECCV’12].”
OXFORD_VGG — Oxford University — “In this submission, image classification was performed using a conventional pipeline based on Fisher vector image representation and one-vs-rest linear SVM classifiers.”
University of Amsterdam — University of Amsterdam — “We extend the fine grained coding approach of last years LSVRC classification winners [1] by fine grained color descriptors and a calibrated SVM with a cutting plane solver. We provide the cutting plane solver 10% of the negative examples per class to train an exact model (not stochastic).”
XRCE/INRIA — XRCE/INRIA — “Our low-level descriptors are the SIFT of [Lowe, IJCV 2004] and the color features of [Perronnin et al., ECCV 2010]. They are aggregated into image-level features using the Fisher Vector (FV) of [Perronnin et al, ECCV 2010].”
And…finally:
SuperVision — University of Toronto — “Our model is a large, deep convolutional neural network trained on raw RGB pixel values. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three globally-connected layers with a final 1000-way softmax. It was trained on two NVIDIA GPUs for about a week. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of convolutional nets. To reduce overfitting in the globally-connected layers we employed hidden-unit “dropout”, a recently-developed regularization method that proved to be very effective. “
Look, even if you don’t understand what is written up there in the abstracts, just looking at keywords, don’t you notice something? SuperVision was the only one that mentioned neural networks, gpus, softmax, neurons, whereas the current establishment was about SVM’s (Support Vector Machines), Fisher Vectors and other feature-extracting techniques.
So the use of neural networks itself was VERY novel.
They ran their computations on GPUs.
Wow. Graphics Processing Units running neural nets? That was something really new and interesting. GP-GPU, which stands for General Purpose GPU-Computing was around since 2009, although in a very low-level, raw form, only through the means of it’s language extension CUDA-C/C+++.
To be frank about it, in the early days of GPU computing, it could be felt that there was something big within it, as the papers started to report crazy speedups of gpu-parallelized versions over their serial-cpu counterparts. Niches such as CFD (Computational Fluid Dynamics) and Finance (Option Pricing, anyone?!) were the first to reap the benefits of the massive parallelism, but a real killer application was yet to come.
So in 2012 when AlexNet paper was written, the community knew they had something special. Who would have guessed that?! It turns out that the little gaming monsters created by NVIDIA were also particularly well-suited to run Convolutional Networks, and particularly, several layers of stacked convolutions.
Now, that’s important for a lot of reasons. You see, what is really encoded in those abstracts about what methods were used is an utterly defiance of the whole visual computing, image processing community methods and core beliefs.
Everyone was talking about extracting features, caring about them, processing them and them feeding it to a classifier. There is even one that particularly treats the special cases of dog breeds and its body format:
Since bodies of dogs are highly deformable, the parts being most reliably detectable are their heads. (…) Therefore, we use a simple head detector by applying a rough circle transform to find eyes and noses and then search for 3 circles that compose a triangle.
And what about our friends at the University of Toronto? Well, I will quote them again, for the sake of contextualized comparison:
All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
Again, we need no expert to notice there’s somewhat of a different approach here. On one side we have people caring to build efficient feature extractors to recognize dogs (which were only part of the problem, by the way), and on the other side we have people that literally say: “Hey, just throw me more data and more computing power. I can handle this.”.
This approach was yet-to-become the de-facto standard in the community, which would be headed into an ever-growing number of layers, thus becoming more…long? Nah. Deep! We then witnessed the emergence of deeper and deeper neural networks, which then somehow managed to change it’s own very name to Deep Learning.
The start of and era
In case you wondered, since the 2012 competition Convolutional Neural Networks based on GPUs completely dominated ImageNet competitions, and also real-world imaging applications such as those used in facial recognition by Baidu, Google, Microsoft and Facebook.
To be honest, the Google Brain project had commenced in 2011, with Andrew Ng and his students being sponsored by Jeff Dean to study and develop a brain-like intelligence within google’s infrastructure. They hired the whole AlexNet team after they published their paper and their results with the convolutional networks. Google Brain, we must remember, by the time it published it’s foundational paper, popular by “discovering cats on youtube videos without supervision”, wasn’t using GPU’s, nor convolutional neural nets.
Inspired by it’s accomplishments in the visual computing community, other fields that were for long stuck in the same ballpark of error or quality measures, started to carefully study whether they also could benefit from stacking layers of non-linear and linear processing units (called Neurons) and training them with Backpropagation algorithm.
It turned out that yes, a lot of them could indeed benefit. And they have benefited a lot since them. In particular, the NLP community is also witnessing now a revolution it it’s field of work.
One of it’s central problems, the so-called MT — Machine Translation, has long been addressed by statistical learners based on hand-engineered feature extractors (just like it was extensively done in the visual computing community).
In November 2016, Google’s Brain novel creation, the GNMT — Google Neural Machine Translation system, started being used in the Google Translate application, in preference to its previous statistical translator.
It also delivered the biggest improvements in speech recognition of the last 20 years, both inside and outside of Google. Baidu is currently the benchmark in speech recognition and analysis, and has reportedly transformed itself in an AI-First company, or All-in-AI. Everyone should pay attention to Baidu.
So, what now?
We live in great times of unprecedented scientific and business opportunities. The deep learning tools and algorithms are just a newborn and there is, for sure, a lot yet to come. We shall see entirely new business be built by the means of deep learning algorithms, and new decision-support systems be created to enhance human’s abilities and create more value.
I find it particularly interesting the ideia from Andrew Ng that:
Just as electricity transformed almost everything 100 years ago, today I actually have a hard time thinking of an industry that I don’t think AI will transform in the next several years.
AI is the new Electricity.
But whom is to possess and control the wealth generated by this new, rising economy? It is tempting to think of the algorithm-educated people. But as we move, we see that they are no different than the engineers formed during the Industrial Revolution: of crucial importance, but not the richest and the most powerful.
The engineers needed the machines. The data scientists needs the data.
That’s why Google open sourced (read: gave away for free) it’s framework for creating deep learning models: Tensorflow.
Microsoft did the same with: CNTK
Facebook has a lot of open projects on Github, including PARLAI (for conversational bots) and FAIR-SEQ (For Machine Translation and other sequence-to-sequence applications).
Amazon too: Deep Scalable Sparse Tensor Network Engine (DSSTNE)
Baidu: PaddlePaddle
Is that clear enough? Above there is millions of dollars worth of research and development, all open for anyone to download, use and modify it. (and also to suggest improvements!)
Concluding remarks
Business owners need to understand that they now have a data business, and that is very different from “going digital, going to the cloud or just installing new ERP/CRM/BI softwares across the enterprise”. The AI is now the core business. It should be ran by the CEO, and not by the IT guy that just fixed your e-mail.
My core belief is that executives themselves must become data-aware and ai-aware, instead of just delegating it to an head of AI. I do not mean that executives will have to learn to code, but learn to understand the models, their capabilities, limitations, and how do they fit in (or not) in their business models.
Jeff Bezos, Mark Zuckerbeg, Sundar Pichai…none of them code, but I am certain that they are fully aware of their roadmaps integration with AI.
Next!
In the next articles we will explore practical approaches (Hooray! With code, figures and examples!) to each one of the core applications of Deep Learning so far.
Join me in this nice adventure of captioning images, generating lawyer-ish articles, predicting the stock market, and other fun AI applications.
If you read this whole article, I am deeply honored and thankful for your time. I understand it is scarce and hope I had provided as much value as possible.
Please let me know if you have any comments or questions. I will do my best to answer everyone in the best possible way.
