Convolutional Neural Networks on Tabular Datasets (Part 1)

Martín Villanueva
spikelab
Published in
6 min readApr 29, 2021

--

In this series of articles, we will dig into how to use one of the most successful neural network architectures, namely Convolutional Neural Networks (CNN), for solving supervised problems on tabular datasets.

We will use PyTorch-Lightning to fast prototype a novel CNN model for tabular problems, and compare it with a standard neural network and also with gradient boosting machines.

Motivation

A couple of moths ago I participated in the Mechanisms of Action competition hosted by Kaggle. The objective in this competition was to predict the mechanism of action, i.e. the biological activity of a given molecule, based on the genetic and cellular response observed on different samples.

The final standings were very tight. Teams in the top 10 positions had differences of 0.0001 in the evaluation metric! However, there was a common pattern on these teams: they used CNNs as a key part of their solution.

In this first article, we will review the CNN model used by the 2nd place solution. When I first read his/her write-up and understood it, I was amazed by the simplicity and effectiveness of the model architecture he/she named 1D-CNN.

Their final submission was an ensemble of 1D-CNN and TabNet, however, the 1D-CNN by itself could have obtained the 5th position, and was the best performing single model in the competition. To be clear, the top 10 submissions are ensembles of many models. A single model ranking on top 5 is an impressive result.

Convolutional Networks

It is well-known that CNNs are the de-facto model architecture for solving any computer vision problem. All the state-of-the-art algorithms for CV problems use CNN in their building blocks. The effectiveness of CNNs on tasks involving image/video processing is because they take into account the spatial structure of data, capturing spatially local input patterns.

Convolutional kernels are great feature extractors that exploit two properties on the input images: local connectivity and spatial locality. Local connectivity means that each kernel is connected to a small region of the input image when performing the convolution. The spatial locality property means that the pixels/voxels where the convolutional kernel is applied are highly correlated, and usually processing them jointly makes it possible to extract meaningful feature representations. For example, a single convolutional kernel can learn to extract edges, textures, shapes, gradients, and so on.

Example of convolution operation on a 2-dimensional input image.

What happens when we try to apply a CNN to a tabular dataset? We can use a 1-dimensional convolutional layer, however, this layer expects spatial locality correlation between the features. In other words, the convolutional kernel expects that columns that are contiguous, are spatially correlated (the relative position of columns means something) and this is not true for almost any tabular dataset.

The convolutional kernel expects that columns that are contiguous, are spatially correlated.

Soft-Ordering 1-dimensional CNN

So, here comes the wonderful idea. We can’t feed a tabular dataset straight forward to a convolutional layer because tabular features are not spatially correlated… but… what if we re-order the tabular features so that they are?

And this is what the mysterious user tmp used, as shown in his/her write-up. As we will see, the idea of order is a little bit different. The diagram below is a summary of the model architecture

1d-cnn presented here: https://www.kaggle.com/c/lish-moa/discussion/202256

In his/her own words:

As shown above, feature dimension is increased through a FC layer firstly. The role of this layer includes providing enough pixels for the image by increasing the dimension, and making the generated image meaningful by features sorting.

It first increases the size of the input from 937 (original features) to 4096 through a standard fully connected layer. Then this layer is reshaped in 256 channels containing signals of size 16 (or images of size 16x1). In simple words, each of these signals corresponds to a group of 16 ordered features, and we have 256 groups with different orderings 💥.

However, the values we observe on the 16-size signals are not the same we observe in the original features, but some kind of non-linear combination of them. This is why I call this network soft-ordering 1-dimensional CNN.

After the reshaping, features are extracted in the next several 1-dimensional convolutional layers with a skip-like connection. The extracted features are used to predict targets through a fully connected layer after flatten.

Soft-Ordering 1-dimensional CNN: coding it

Below there is a working implementation for this network, coded in PyTorch and framed to be run with PyTorch Lightning. This is strongly based on the implementation tmp shared after the competition ended: github link

Here is the description of the parameters:

  • input_dim: the number of features at input.
  • output_dim: the number of target values to fit.
  • sign_size: the size of the signals to feed the first convolutional layer.
  • cha_input: number of channels to feed the first convolutional layer.
  • cha_hidden: number of channels computed by the hidden convolutional layers.
  • K: channel increase rate for the first convolutional layer. The first layer will increase channels from cha_input to K*cha_input.
  • dropout_input: dropout rate to apply at the input.
  • dropout_hidden: dropout rate to apply at the hidden convolutional layers.
  • dropout_output: dropout rate to apply in the last fully connected layer.

Hands-on

Let’s get our hands dirty, and see the potential of this novel neural network. In this experiment, we used the dataset from the Santander customer transaction prediction Kaggle competition. This is a binary classification problem consisting of: predict if a customer will perform a transaction in the future, given an anonymized set of features of user behavior.

For the benchmark we test 3 models:

  • MLP: Standard 3-layers fully connected neural network.
  • SoftOrdering1DCNN: Our novel method.
  • LightGBM: Which was one of the best performing models on the competition.

These are the AUC scores on the validation and the hold-out test datasets:

SoftOrdering1DCNN is doing better than the fully connected MLP, and also is very close in performance to LightGBM. With further parameter tuning it is expected to have even better results!

You can check the details of this benchmark on this Kaggle notebook.

Conclusions

The efficient feature extraction capacity of convolutional neural networks makes them the undisputed winners of almost any computer vision problem. In this article we have review how we can take advantage of CNNs on tabular datasets through the novel Soft-Ordering 1-dimensional CNN.

The idea of introducing order before the convolution layers is simple and effective. I think there is a lot room of improvement if we refine the ordering layer. One idea that comes to mind is to use SparseMax activation in the first fully-connected layer, which would allow us to have some kind of hard-ordering.

In the next part of this series we will review another amazing convolutional neural network architecture: DeepInsight. This is the one that obtained 1st place on MoA, and blew the minds of many participants. Stay tuned!

--

--