Tabular Data and Deep Learning: Where Do We Stand?

Published in

codon-consulting

7 min readMay 7, 2020

by Mikael Huss

During the 2010s, deep learning revolutionized computer vision and natural language processing, but plain old tabular datasets have proved a tougher nut to crack. Are we on the verge of a breakthrough in tabular data processing using neural networks?

This post introduces a mini-series on tabular data with deep learning. During the past year, I have spent some time benchmarking different machine learning algorithms , including methods based on deep learning, on tabular data. This post sets the stage for three more detailed posts on

Google’s attention-based TabNet method,
Yandex’ NODE, a sort of differentiable version of CatBoost,
Entity embeddings with FastAI.

Tabular data

Tabular data (TD) are the type of data you might see in a spreadsheet or a CSV file. They are usually arranged in rows (examples, instances) and columns (features, attributes). Many of the datasets that companies want to extract value from are this type of dataset — e.g. sensor readings, clickstreams, purchase histories, customer management databases — rather than images or text.

Conventional wisdom has it that tabular data problems are still best attacked with non-deep learning techniques, in particular tree ensemble methods like gradient boosting and random forests. However, the past few years have seen an increase in the usage of deep learning (DL) techniques. For instance, the book depicted on the left, “Deep Learning with Structured Data”, came out as an advance publication in 2019 and is due to be published in full during this spring.

(As an aside, I think it is a bit unfortunate that this book calls tabular data “structured data”, which is bound to lead to confusion. For many people, including talented ML folks I’ve worked with, image, sound and text data are “structured” whereas tabular data are “unstructured”. It depends on where you see the structure, I suppose. The people who think images are structured data probably do so because they have local structure (nearby pixel values tend to be highly correlated), so that for example convolution operations can model them well. By contrast, tabular data do not benefit from convolution because there is typically no spatial structure there. On the other hand, the people who think of tabular data as “structured” probably do so because for tabular data, a human has decided what should go into each column, and each column thus has a meaning.)

This TWiML AI episode is a discussion about the book.

Strengths of deep learning

In general, deep learning (neural networks stacked in many layers, sometimes hundreds or thousands of them) is effective because it can learn deep hierarchical representations of data. Language and also the visual world has structure that can be analysed at a more atomic level (words, phrases, edges, corners), and at a higher level (sentences, grammar, objects, relationship between objects). Before deep learning started to be effective in the 2010s, language processing and image analysis relied on creating hand-crafting features that reflected certain properties of the data, but today, models like BERT (for language) and DenseNets (for image analysis) are able to learn very informative representations of the data, removing the need for feature engineering.

In addition, images and language data has local structure will lends itself well to certain types of operations, such as convolutions, which are implemented in all standard neural network libraries.

For tabular data, there is generally speaking no local or hierarchical structure (although there could be, in specific cases). For this reason, many people think that deep learning is irrelevant to tabular data. Instead, past experience seems to indicate that versions of decision tree ensembles (random forests, gradient boosting etc.) are the most reliable methods for tabular data.

Why explore deep learning for tabular data?

So if experience points to decision tree ensembles having the best performance, why not just use those?

Researchers have proposed a few potential benefits of using DL:

It might turn out to work better, especially for very large datasets. (The very interesting Applying Deep Learning to AirBnB Search paper mentions that AirBnB uses gradient boosting for small to medium sized problems and DL for large ones.)
Deep learning unlocks the possibility to train systems end-to-end with gradient descent, so that image and text data can be plugged in without changing the whole pipeline.
It is easier to use deep learning models in an online mode (with data points arriving one by one, “streaming” rather than all at once), because most tree-based algorithms need global access to data to determine split points.

However, I can see at least one drawback:

Deep learning models are often complex and rely on extensive hyperparameter optimization, which is much less of a problem in random forests and gradient boosting — they often perform quite well without any parameter tuning!

Let’s now look briefly at some different approaches to tabular data with neural networks: entity embeddings, attention mechanisms, and mimicking decision trees.

Entity embeddings

The idea behind entity embeddings is to learn to encode each value of the categorical variable as a numerical vector, often with a low dimensionality. The embeddings are learned during training, as a “side effect” of trying to solve e.g. a classification problem. As this blog post describes, this technique has been used to good effect by companies like Instacart and Pinterest for a couple of years already.

Embedding functionality is already quite standard in deep learning libraries like Keras, PyTorch and FastAI. In this blog post, I go into more detail on how to use embeddings with FastAI (v1 and v2) when analyzing tabular data.

Attention mechanisms

Neural attention mechanisms have become all the rage in the past five years, in particular for language models. There are different kinds of attention mechanisms, and the perhaps most well-known NLP package right now, BERT, uses a version called “self-attention”. To put it simply, attention mechanisms are a way for a neural network to learn which parts of the input it should focus on at each moment in time. They allow the network to pay attention (!) just to the input that it needs to worry about at the moment.

Google Research has published an interesting manuscript called TabNet: Attentive Interpretable Learning, where they introduce an attention-mechanism based neural network architecture which they claim outperforms the best previous methods on several datasets. They have also released some code for trying TabNet, but the code is currently hard-coded for a specific dataset and takes a bit of work to customize. In this blogpost, I go into more depth on TabNet and show you some code that makes it a bit easier to use it on an arbitrary dataset. However, the easiest way to use TabNet currently is probably this Pytorch implementation, which has a nice scikit-learn-type interface.

Mimicking decision trees with neural networks

Another interesting idea is to try to generalize decision trees and tree ensembles with neural networks by using “soft” versions of their splits. There are many papers with slightly different approaches to this. One that I found interesting is Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data, which presents an architecture that is a kind of differentiable, hierarchical version of CatBoost (which is one of my favorite methods). The hierarchical part is that you can stack several CatBoost-like modules on top of each other (and use skip connections with them, so you get something like a CatBoost-DenseNet!). The way to make it differentiable is to use the “entmax” transformation, which performs a “soft split” on feature values.

Explaining this paper takes some unpacking, so I do that in a separate blog post, which also takes a closer look at how CatBoost is different from other gradient boosting methods.

Hybrid methods

Apart from the “pure” ideas described above (of course, there are no pure ideas!), there are many hybrid techniques that combine elements of DL and traditional ML. One straightforward idea is to learn entity embeddings with a DL model and then use those in a gradient boosting model. Or you could flip that around and use gradient boosting information as input to DL embeddings. Say what? The already mentioned AirBnB paper described how they used, for each input example, the IDs of the leaf nodes activated in each tree as a categorical feature to embed (see Figure 3.)

Conclusions?

After having explored the various techniques described above, I am still not convinced that there is currently a deep learning based method that performs better, on average, on tabular data than gradient boosting. NODE and TabNet require careful hyperparameter optimization, while FastAI Tabular is easier to use. The various benchmarks that are found in the respective papers, various GitHub repos and that I have performed myself still give conflicting results, so I am not prepared to proclaim a winner yet!

The upcoming FastAI book seems to agree (see the Beyond Deep Learning section in this chapter). In other words, we are still waiting for the deep learning-based tabular data modelling method to beat decision tree ensembles.

What do you think? Please comment with any promising methods you may have seen!

About the author, Mikael Huss

Mikael Huss is a senior data scientist and co-founder of Codon Consulting and holds a PhD in computational neuroscience and an associate professorship in bioinformatics. Mikael works with, and likes to blog about machine learning and deep learning. Apart from his 15+ years of academic research, he has wide experience from applying machine learning to industries such as retail, manufacturing, and medical imaging. Before joining Codon Consulting, he was a data scientist at Peltarion.