Deep Learning with Tabular Data

Introduction
Deep Neural Networks (DNNs) has produced incredible results in the past few years in the fields of artificial intelligence with computer vision, audio, video, as well as natural language processing.
But its usage with tabular data, which are the most used types of data that business decisions are based on, has failed to meet the predictive capability as well as explain ability of ‘classical’ machine learning models such as decision trees. This article will explain some of the challenges facing DNNs in tabular data and go over the current innovations on overcoming those challenges.
This article borrows heavily from the wonderful paper Deep Neural Networks and Tabular Data: A Survey
by: Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, Gjergji Kasneci
Challenges
The issues with using DNNs with tabular data can be summarized as follows:
1. Data Quality Issues
· Real world tabular data often contains missing values, outliers, inconsistent, or erroneous data.
· Tabular data are often high-dimensional with relatively small sample sizes.
· Tabular data is often expensive to obtain and hard to come by, and the dataset are often class imbalanced.
2. Lack of Spatial dependencies
· Unlike images, audio, and videos where neighboring pixels and bits provide spatial context, there are no such relationships available in tabular data.
· Research has hypothesized that even if there are spatial correlations between variables in tabular data, it is rather complex and irregular and difficult to determine.
3. Difficulty in Preprocessing
· Tabular data is hard to preprocess. One of the main challenges is converting categorical features into numerical representations without creating sparse matrixes.
· Another issue to watch out for is inadvertently encoding an alignment or ordering where none exist.
· Some implementation attempts to resolve this issue by encoding the categorical features in an embedding space, but this is a training time task that cannot be done prior.
4. Model sensitivity
· Unlike classical machine learning models such as decision trees, DNNs are very sensitive to small changes in the input data.
· Tabular data are often highly variable from one sample to the next.
Current State
Some of the methods that current state-of-the-art research and architectures use to address the challenges can be summarized as:

Data Transformation
Data transformation are a set of techniques that transforms the data into a format that can be used as inputs to a neural network. These techniques can be further separated into single dimensional and multi-dimensional techniques.
Single dimensional
Deterministic, can be used before training. Can be as simple as ordinal encoding, binary encoding, leave-one-out encoding, hash-based encoding.
Multi-dimensional
Using self or semi-supervised techniques to encode the categorical values into a dense embedding space.
Hybrid Models
Hybrid models seeks to combine neural networks with traditional machine learning architectures such as decision trees to achieve the best of both worlds, these can be further separated into fully differentiable models and partially differentiable models.
Fully differentiable
Permits end to end optimization using gradient descent. Highly efficient on GPU.
Partly differentiable
Combining non-differentiable models such as decision trees with deep neural networks. Utilizing different models to handle numerical and categorical features.
Transformers
Transformer architecture, driven by its success in natural language processing and computer vision, has been the most actively researched topic lately. There has been a lot of advancements in this area, including TabNet, TabTransformer, ARM-net, SAINT, etc. Utilizes multiple subnetworks and self-attention mechanism to handle categorical features and incorporate varying techniques such as decision trees, k-nearest-neighbor, and feature crosses.
Regularization
The idea of regularization is that the wild swings in predictions from minor variances with the tabular data input can be solved by the proper regularization techniques. One proposal, the Regularization learning network, proposes the application of trainable regularization coefficient to lower the overall model sensitivity. Another, a sort of “regularization cocktail”, applies multiple regularization techniques together. A paper in 2021 used 13 regularization techniques together that claimed increased performance over tree-based models with just a typical feed forward network.
Data Generation
Beyond classification and regression, another area of research on tabular data with neural networks is data generation. There are a lot of reasons why you’d want to generate synthetic data, and specifically with tabular data, they include:
· Tabular data is difficult and expensive to come by. Training data is usually limited.
· Data augmentation and imputation (filling in missing values)
· Rebalancing imbalanced classes
· Ensure privacy
So what are some ways we can generate synthetic, tabular data?
· Generative Adversarial Networks (GANs)
· MedGAN for domain specific generations
· Variational Autoencoders
· Various VAEs, can outperform GANs, but both are considered state of the art.
Now that we can some synthetic data, how can we tell if the quality is good?
· Typically using a proxy classification task that is trained using generated tabular data.
· The prediction is done using real data to assess the quality of the generated data.
· Another approach is using statistical methods to generate data based on original data’s distribution.
Explain ability
Explainable AI is an important topic, now more than ever. The ability to understand what predictions are based on is hugely important in the real world and especially when it comes to fairness and equity. One important way this is achieved is using Feature Highlighting. By constructing the model that are explainable by design. In cases where the model parameter is not available, use surrogate models or a benchmarking library.
Future Directions
Deep Learning with Tabular Data is an actively researched topic. Tabular data is the most used type of data in businesses and as such, can have the potential to produce the most impact.
In the area of data processing, the trend is to continue the transformation into homogeneous representations such as an embedding.
With model architectures, transformers have taken the lead and offers multiple advantages such as the ability for self-attention to cover both categorical and numerical features.
With regularization, as recent research have shown great success with this technique, there is no doubt that further research will continue in this field.
With data generation, it is a very difficult task that has not been solved yet. The possible space is effectively infinite, so more research needs to happen in this area.
Lastly, explainable AI is the cornerstone of being able to fully utilize AI in business decisions. Neural networks still has a long way in this area to be able to match the explain ability of classical techniques such as decision trees.
