Nope, Deep Learning is not enough

Melvin Varughese
Data at Atlassian
Published in
8 min readNov 18, 2020

Despite BERT, AlphaGo and ImageNet domination…

Photo by Nadir sYzYgY on Unsplash

The 2010s have been a transformative decade for Machine Learning — much of it driven by innovations within Deep Learning. Theoretical advances starting from the late 80s and running right through to the aughts have allied with both ever-larger datasets and increasing computational power, enabling Deep Learning to have a profound impact in several areas such as image recognition, gaming, natural language processing and speech recognition. Some of the most relevant developments in these areas include:

With such high-profile milestones, it shouldn’t be surprising that many people who are eager to expand their skillset choose to prioritize the mastery of Deep Learning — often, unfortunately, at the expense of training in other more fundamental areas. For both dilettante and practitioner, there can be a bias that implicitly assumes that any solution that is based on Deep Learning is state-of-the-art. This bias can constrain innovation, preventing the investigation of alternative solutions.

It is important to understand what Deep Learning’s limitations are and what is the right context for its use. In this first article of a two-part series, we will first focus on Deep Learning’s strengths and shortcomings. I will follow this with a second article on some undervalued skills that are more important than Deep Learning.

Where Deep Learning excels

The application of Deep Learning has been particularly fruitful within the field of perception. From transcribing speech to classifying images to understanding natural language, the best performing machine learning algorithms all use deep neural networks. All these areas of application have two things in common:

  1. the ground-truth is extremely complex, defying description with a simple set of features or rules
  2. the level of noise present in the data does not obscure the underlying structure of the ground truth

When the above two conditions are coupled with a very large dataset, Deep Learning is in its element. The large number of parameters within a deep neural network together with a large set of labelled data enables the model to learn extremely complex relationships.

Relative Performance of Deep Learning when the ground-truth is complex (source: Sumo Logic)

However, many datasets do not meet the above conditions. Indeed, in Machine Learning competitions, it is not uncommon for competing Machine Learning algorithms — particularly XGBoost — to deliver superior results.

Where Deep Learning is unlikely to work

In many scenarios, Deep Learning should not be the first… nor the second model that you train. That said, it is impossible to provide a definitive rule as to when Deep Learning won’t work. What follows is a range of scenarios where Deep Learning is unlikely to work.

1. When the dataset is small

A typical deep neural network will have millions of parameters. The more parameters you have, the more data you will need to control the risk of overfitting. If your dataset is small, you tend to be better off starting with a simpler model with far fewer parameters.

There are some cases where Deep Learning remains feasible despite a small dataset. For instance, if your problem space overlaps with a well-established area for Deep Learning, it may be possible to use a pre-trained model (whose weights were learnt using a massive dataset) and simply update the model weights using your more modest dataset. This is an example of transfer learning. The resulting model will have leveraged the information from the much larger dataset, but still, be tailored to the peculiarities of your problem space. However, outside human perception problems, you are unlikely to find a pre-trained Deep Learning model that is suitable for your needs.

Also, despite having a large number of parameters, some neural network architectures are surprisingly efficient, being able to provide reasonable results with just a couple thousand observations. For example, Convolutional Neural Networks are known to be quite efficient (source: Chollet., F. 2018. Deep Learning with Python. pp 130).

In an era of ever-increasing volumes of data, it may seem reasonable to assume that small datasets will become increasingly rare. However, this is not necessarily the case due to two main reasons. Firstly, labelling data often requires human judgement, which is expensive. In many scenarios, obtaining new data is cheap, but labelling each observation will be prohibitive. Secondly, the volume of data available depends on the industry generating the data. The largest B2C companies (enterprises whose end users are consumers) include Google and Facebook — companies with billions of customers, each of whom will have data recorded for them. Consequently, such companies have accumulated enormous datasets that can enable Deep Learning models to deliver superior results. Contrast this with B2B companies (enterprises whose end-users are other businesses) where the very largest companies will have tens of thousands of customers. In such cases, alternative models are likely to be more appropriate.

2. When the training set is not representative

The greatest strength of deep neural networks is their ability to automatically learn the best features (or representations) for a particular challenge. However, this strength can sometimes be a weakness: the resulting models can be brittle since the learnt representations do not generalize well outside the dataset. This is particularly problematic when the training data is not representative of the data over which we wish to apply the trained model.

There are at least two reasons why the training set may not be representative of the test set:

  1. In the case of time series data, the data could be non-stationary. The trained Deep Learning model may produce exemplary results shortly after the model is trained, but as conditions deviate further from those under which the model was trained, the model’s performance rapidly deteriorates.
  2. It may be easier to provide labels for a subset of the data. In medical studies, some patients may be easier to track than others. In astronomy, it is easier to study nearby or brighter objects. Consequently, the labelled data can be biased towards those data points that are easier to label. In such cases, the cross-validated performance on the training set may grossly over-estimate the performance of the test set.

When the training set is not representative, it will often be better to hand-craft features that are more robust to the biases in the training set. In the case of a non-stationary time series, adaptive models that can be quickly trained on recent data and whose primary goal is short-term forecasting will be more appropriate.

3. When model interpretability is important

The reality is that incremental bumps in model performance are unlikely to have a transformative impact on most businesses. What matters more is first using analytics to uncover actionable insights and subsequently convincing external stakeholders within the business to make model-informed decisions. To foster trust in your analytics findings, your audience must understand the dynamics driving the current state of the business and how appropriate changes can drive better business outcomes. Most such stakeholders will have a limited understanding of data science, which makes effective communication difficult. Unfortunately, neural networks are not amenable to interpretation, often making them a poor choice for communication.

Advocates of Deep Learning may point to recent work in model interpretability. For example, with CNNs it is easy to visualize the filters that the model learns or the relative importance of different parts of an image in classifying an image a certain way. However, unless you are in the business of automating human perception, not much of such work is relevant for communication in a business context.

In contrast, competing models such as XGBoost will use features that are easy to understand. Many alternatives to Deep Learning also provide outputs that are useful for dissemination. These include:

  • variable importances: identifies which features are the primary drivers affecting the variable you are predicting. In the case of Deep Learning, the representations learnt are uninterpretable so knowing their relative importance is not helpful.
  • partial dependence plots (or similar): helps with understanding the relationships between various features and gaining a deeper intuition of the underlying dynamics driving business outcomes.
  • quantifying model uncertainty: needed to understand the risk associated with any decision. Traditionally, Deep Learning methods have not represented model uncertainty.

Any model output that helps the user to develop their intuition of the problem area is more likely to be trusted. Trust with stakeholders is key for a Data Scientist to maximise their impact.

4. When data is noisy or messy

When the data is blighted with noise, missing records or biases, complexities within the ground truth will be obscured. A deep neural network will in such conditions be more likely to overfit, learning noise or artefacts from the data problems rather than the underlying ground truth.

Beyond supervised learning

Deep Learning has established itself as the preferred approach for both perception problems in Supervised Machine Learning and gaming challenges in Reinforcement Learning. In cases where Deep Learning is suboptimal, XGBoost often delivers the best results. However, the hierarchy is unclear outside these areas, making the usage of less popular methods worthy of consideration. In the case of unsupervised learning, manifold learning approaches are especially promising. These include some deep neural network architectures such as Variational Autoencoders and Adversarial Networks. However, there are certainly not the only approaches. Spectral clustering for instance has achieved compelling results for many problems.

Some other areas of Machine Learning that often find application in business include:

  • Anomaly Detection: can be used to detect fraudulent transactions or malicious activity on a company network. There are a rich array of approaches, with the best approach being highly context-dependent.
  • Active Learning: given that labelling data is often very costly, these algorithms attempt to be strategic in their request for labels, prioritising labelling for data points that will yield the most information. This is an important branch of machine learning that has a rich array of approaches.

Conclusion

Deep Learning has enjoyed remarkable success over the past decade, generating much of the excitement that Machine Learning currently enjoys. In practice though, the prevalence of Deep Learning in Data Science solutions is much less than its media exposure might imply.

It is important for Data Scientists to understand Deep Learning’s limitations and when other methods may be more appropriate to use. Even when you wish to use Deep Learning, you’ll need to have mastery of a variety of techniques to clean the data and control for overfitting.

Learn more about Data at Atlassian

Stay in touch with the Analytics & Data Science team and learn more about careers at Atlassian here.

--

--

Melvin Varughese
Data at Atlassian

Helping to drive growth @ Atlassian. Visting Academic at University of Western Australia & University of Cape Town. https://www.linkedin.com/in/melvinvarughese