Deep Learning with Tabular Data

Introduction

Deep Neural Networks (DNNs) has produced incredible results in the past few years in the fields of artificial intelligence with computer vision, audio, video, as well as natural language processing.

But its usage with tabular data, which are the most used types of data that business decisions are based on, has failed to meet the predictive capability as well as explain ability of ‘classical’ machine learning models such as decision trees. This article will explain some of the challenges facing DNNs in tabular data and go over the current innovations on overcoming those challenges.

This article borrows heavily from the wonderful paper Deep Neural Networks and Tabular Data: A Survey

by: Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, Gjergji Kasneci

Challenges

The issues with using DNNs with tabular data can be summarized as follows:

1. Data Quality Issues

· Real world tabular data often contains missing values, outliers, inconsistent, or erroneous data.

· Tabular data are often high-dimensional with relatively small sample sizes.

· Tabular data is often expensive to obtain and hard to come by, and the dataset are often class imbalanced.

2. Lack of Spatial dependencies

· Unlike images, audio, and videos where neighboring pixels and bits provide spatial context, there are no such relationships available in tabular data.

· Research has hypothesized that even if there are spatial correlations between variables in tabular data, it is rather complex and irregular and difficult to determine.

3. Difficulty in Preprocessing

· Tabular data is hard to preprocess. One of the main challenges is converting categorical features into numerical representations without creating sparse matrixes.

· Another issue to watch out for is inadvertently encoding an alignment or ordering where none exist.

· Some implementation attempts to resolve this issue by encoding the categorical features in an embedding space, but this is a training time task that cannot be done prior.

4. Model sensitivity

· Unlike classical machine learning models such as decision trees, DNNs are very sensitive to small changes in the input data.

· Tabular data are often highly variable from one sample to the next.

Current State

Some of the methods that current state-of-the-art research and architectures use to address the challenges can be summarized as:

Data Transformation

Data transformation are a set of techniques that transforms the data into a format that can be used as inputs to a neural network. These techniques can be further separated into single dimensional and multi-dimensional techniques.

Single dimensional

Deterministic, can be used before training. Can be as simple as ordinal encoding, binary encoding, leave-one-out encoding, hash-based encoding.

Multi-dimensional

Using self or semi-supervised techniques to encode the categorical values into a dense embedding space.

Hybrid Models

Hybrid models seeks to combine neural networks with traditional machine learning architectures such as decision trees to achieve the best of both worlds, these can be further separated into fully differentiable models and partially differentiable models.

Fully differentiable

Permits end to end optimization using gradient descent. Highly efficient on GPU.

Partly differentiable

Combining non-differentiable models such as decision trees with deep neural networks. Utilizing different models to handle numerical and categorical features.

Transformers

Transformer architecture, driven by its success in natural language processing and computer vision, has been the most actively researched topic lately. There has been a lot of advancements in this area, including TabNet, TabTransformer, ARM-net, SAINT, etc. Utilizes multiple subnetworks and self-attention mechanism to handle categorical features and incorporate varying techniques such as decision trees, k-nearest-neighbor, and feature crosses.

Regularization

The idea of regularization is that the wild swings in predictions from minor variances with the tabular data input can be solved by the proper regularization techniques. One proposal, the Regularization learning network, proposes the application of trainable regularization coefficient to lower the overall model sensitivity. Another, a sort of “regularization cocktail”, applies multiple regularization techniques together. A paper in 2021 used 13 regularization techniques together that claimed increased performance over tree-based models with just a typical feed forward network.

Data Generation

Beyond classification and regression, another area of research on tabular data with neural networks is data generation. There are a lot of reasons why you’d want to generate synthetic data, and specifically with tabular data, they include:

· Tabular data is difficult and expensive to come by. Training data is usually limited.

· Data augmentation and imputation (filling in missing values)

· Rebalancing imbalanced classes

· Ensure privacy

So what are some ways we can generate synthetic, tabular data?

· Generative Adversarial Networks (GANs)

· MedGAN for domain specific generations

· Variational Autoencoders

· Various VAEs, can outperform GANs, but both are considered state of the art.

Now that we can some synthetic data, how can we tell if the quality is good?

· Typically using a proxy classification task that is trained using generated tabular data.

· The prediction is done using real data to assess the quality of the generated data.

· Another approach is using statistical methods to generate data based on original data’s distribution.

Explain ability

Explainable AI is an important topic, now more than ever. The ability to understand what predictions are based on is hugely important in the real world and especially when it comes to fairness and equity. One important way this is achieved is using Feature Highlighting. By constructing the model that are explainable by design. In cases where the model parameter is not available, use surrogate models or a benchmarking library.

Future Directions

Deep Learning with Tabular Data is an actively researched topic. Tabular data is the most used type of data in businesses and as such, can have the potential to produce the most impact.

In the area of data processing, the trend is to continue the transformation into homogeneous representations such as an embedding.

With model architectures, transformers have taken the lead and offers multiple advantages such as the ability for self-attention to cover both categorical and numerical features.

With regularization, as recent research have shown great success with this technique, there is no doubt that further research will continue in this field.

With data generation, it is a very difficult task that has not been solved yet. The possible space is effectively infinite, so more research needs to happen in this area.

Lastly, explainable AI is the cornerstone of being able to fully utilize AI in business decisions. Neural networks still has a long way in this area to be able to match the explain ability of classical techniques such as decision trees.

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Reinforcement Learning for Beginners: Q-Learning and SARSA

Crowd Surge Analysis

Machine Learning in the business context — how to find a viable project

Zero-Shot Text Classification & Evaluation

“Applying Artificial Neural Networks to a Genetic Algorithm” or “How smart can I make a dot?”

Support Vector Machine

Linear Regression: Everything From Math to Program

Complete Detailed Tutorial on Linear Regression in Python for Beginners

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jimmy Liang

Jimmy Liang

More from Medium

Python Client for Google BigQuery to add and read data

Customer Transaction Prediction

On-Shelf Availability: Predicting Out-Of-Stock Situations, Fueling Profitability and Preventing…

How to present risk related data, COVID case study