How to Incorporate Tabular Data with HuggingFace Transformers
By Ken Gu
Transformer-based models are a game-changer when it comes to using unstructured text data. As of September 2020, the top-performing models in the General Language Understanding Evaluation (GLUE) benchmark are all BERT transformer-based models. At Georgian, we find ourselves working with supporting tabular feature information as well as unstructured text data. We found that by using the tabular data in our models, we could further improve performance, so we set out to build a toolkit that makes it easier for others to do the same.
Building on Top of Transformers
The main benefits of using transformers are that they can learn long-range dependencies between text and can be trained in parallel (as opposed to sequence to sequence models), meaning they can be pretrained on large amounts of data.
Given these advantages, BERT is now a staple model in many real-world applications. Likewise, with libraries such as HuggingFace Transformers, it’s easy to build high-performance transformer models on common NLP problems.
Transformer models using unstructured text data are well understood. However, in the real-world, text data is often supported by rich structured data or other unstructured data like audio or visual information. Each one of these might provide signals that one alone would not. We call these different ways of experiencing data — audio, visual or text — modalities.
Think about E-commerce reviews as an example. In addition to the review text itself, we also have information about the seller, buyer and product available as numerical and categorical features.
We set out to explore how we could use text and tabular data together to provide stronger signals in our projects. We started by exploring the field known as multimodal learning, which focuses on how to process different modalities in machine learning.
Multimodal Literature Review
The current models for multimodal learning mainly focus on learning from the sensory modalities such as audio, visual, and text.
Within multimodal learning, there are several branches of research. The MultiComp Lab at Carnegie Mellon University provides an excellent taxonomy. Our problem falls under what is known as Multimodal Fusion — joining information from two or more modalities to make a prediction.
As text data is our primary modality, our review focued on the literature that treats text as the main modality and introduces models that leverage the transformer architecture.
Trivial Solution to Structured Data
Before we dive into the literature, it’s worth mentioning that there is a simple solution that can be used where the structured data is treated as regular text and is appended to the standard text inputs. Taking the E-commerce reviews example, the input can be structured as follows: Review. Buyer Info. Seller Info. Numbers/Labels. Etc. One caveat with this approach, however, is that it is limited by the maximum token length that a transformer can handle.
Transformer on Images and Text
In the last couple of years, transformer extensions for image and text have really advanced. Supervised Multimodal Bitransformers for Classifying Images and Text by Kiela et al. (2019) uses pretrained ResNet and pretrained BERT features on unimodal images and text respectively and feeds this into a Bidirectional transformer. The key innovation is adapting the image features as additional tokens to the transformer model.
Additionally, there are models — ViLBERT (Lu et al. 2019) and VLBert (Su et al. 2020) — which define pretraining tasks for images and text. Both models pretrain on the Conceptual Captions dataset, which contains roughly 3.3 million image-caption pairs (web images with captions from alt text). In both cases, for any given image, a pretrained object detection model like Faster R-CNN obtains vector representations for regions of the image, which count as input token embeddings to the transformer model.
As an example, ViLBert pretrains on the following training objectives:
- Masked multimodal modeling: Mask input image and word tokens. For the image, the model tries to predict a vector capturing image features for the corresponding image region, while for text, it predicts the masked text based on the textual and visual clues.
- Multimodal alignment: Whether the image and text pair are actually from the same image and caption pair.
All these models use the bidirectional transformer model that is the backbone of BERT. The differences are the pretraining tasks the models are trained on and slight additions to the transformer. In the case of ViLBERT, the authors also introduce a co-attention transformer layer (shown below) to define the attention mechanism between the modalities explicitly.
Finally, there’s also LXMERT (Tan and Mohit 2019), another pretrained transformer model that as of Transformers version 3.1.0, is implemented, as part of the library. The input to LXMERT is the same as ViLBERT and VLBERT. However, LXMERT pretrains on aggregated datasets, which also include visual question answering datasets. In total LXMERT pretrains on 9.18 million image text pairs.
Transformers on Aligning Audio, Visual, and Text
Beyond transformers for combining image and text, there are multimodal models for audio, video, and text modalities in which there is a natural ground truth temporal alignment. Papers for this approach include MulT, Multimodal Transformer for Unaligned Multimodal Language Sequences (Tsai et al. 2019), and the Multimodal Adaptation Gate (MAG) from Integrating Multimodal Information in Large Pretrained Transformers (Rahman et al. 2020).
MuIT is similar to ViLBert in which co-attention is used between pairs of modalities. MAG, meanwhilte, injects other modality information at certain transformer layers via a gating mechanism.
Transformers with Text and Knowledge Graph Embeddings
Some works have also identified knowledge graphs as a vital piece of information in addition to text data. Enriching BERT with Knowledge Graph Embeddings for Document Classification (Ostendorff et al. 2019) uses features from the author entities in the Wikidata knowledge graph in addition to metadata features for book category classification. In this case, the model is a simple concatenation of these features and BERT output text features of the book title and description before some final classification layers.
On the other hand, ERNIE (Zhang et al 2019) matches the tokens in the input text with entities in the knowledge graph. They fuse these embeddings to produce entity aware text embeddings and text-aware entity embeddings with this matching.
Key Takeaway
The main takeaway for adapting transformers for multimodal data is to ensure that there’s an attention or weighting mechanism between the different modalities. These attention mechanisms can occur at different points of the transformer architecture, as encoded input embeddings, injected in the middle, or combined after the transformer encodes the text data.
Multimodal Transformers Toolkit
Using what we’ve learned from the literature review and the comprehensive HuggingFace library of state of the art transformers, we’ve developed a toolkit. The multimodal-transformers package extends any HuggingFace transformer for tabular data. To see the code, documentation, and working examples, check out the project repo.
At a high level, the outputs of a transformer model on text data and tabular features containing categorical and numerical data are combined in a combining module. Since there is no alignment in our data, we choose to combine the text features after the transformer’s output. The combining module implements several methods for integrating the modalities, including, attention and gating methods inspired by the literature survey. More details of these methods are available here.
Walkthrough
Let’s work through an example where we classify clothing review recommendations. We’ll use a simplified version of the example included in the Colab notebook. We will use the Women’s E-Commerce Clothing Reviews from Kaggle, which contains 23,000 customer reviews.
In this dataset, we have text data in the Title and Review Text columns. We also have categorical features from the Clothing ID, Division Name, Department Name and Class Name columns and numerical features from the Rating and Positive Feedback Count.
Loading The Dataset
We first load our data into a TorchTabularTextDataset, which works with PyTorch’s data loaders that include the text inputs for HuggingFace Transformers and our specified categorical feature columns and numerical feature columns. For this, we also need to load our HuggingFace tokenizer.
Loading Transformer with Tabular Model
Now we load our transformer with a tabular model. First, we specify our tabular configurations in a TabularConfig object. This config is then set as the tabular_config member variable of a HuggingFace transformer config object. Here, we also specify how we want to combine the tabular features with the text features. In this example, we will use a weighted sum method.
Once we have the tabular_config set, we can load the model using the same API as HuggingFace. See the documentation for the list of currently supported transformer models that include the tabular combination module.
Training
For training, we can use HuggingFace’s trainer class. We also need to specify the training arguments, and in this case, we will use the default.
Let’s take a look at our models in training!
Results
Using this toolkit, we also ran our experiments on the Women’s E-Commerce Clothing Reviews dataset for recommendation prediction and the Melbourne Airbnb Open Data dataset for price prediction. The former is a classification task, while the latter is a regression task. Our results are in the table below. The text_only combine method is a baseline that uses only the transformer and is essentially the same as a HuggingFace forSequenceClassification model.
We can see that incorporating tabular features improves performance over the text_only method. The performance gains depend on how strong the training signals from the tabular data are. For example, in the review recommendation case, the text_only model is already a strong baseline.
Next Steps
We’ve already used the toolkit successfully in our projects. Feel free to try it out on your next machine learning project!
Check out the documentation and the included main script for how to do evaluation and inference. If you want support for your favorite transformer, feel free to add transformer support here.
Appendix
Readers should check out The Illustrated Transformer and The Illustrated BERT, for a well summarized overview of transformers and BERT.
Below, you’ll find a quick taxonomy of papers we reviewed.
Transformer on Image and Text
- Supervised Multimodal Bitransformers for Classifying Images and Text (Kiela et al. 2019)
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (Lu et al. 2019)
- VL-BERT: Pretraining of Generic Visual-Linguistic Representations (Su et al. ICLR 2020)
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers (Tan et al. EMNLP 2019)
Transformers on Aligning Audio, Visual, and Text
- Multimodal Transformer for Unaligned Multimodal Language Sequences (Tsai et al. ACL 2019)
- Integrating Multimodal Information in Large Pretrained Transformers (Rahman et al. ACL 2020)
Transformers with Knowledge Graph Embeddings
- Enriching BERT with Knowledge Graph Embeddings for Document Classification (Ostendorff et al. 2019)
- ERNIE: Enhanced Language Representation with Informative Entities (Zhang et al. 2019)