WIDeText: A Multimodal Deep Learning Framework

Wayne Zhang
The Airbnb Tech Blog
10 min readDec 8, 2020


How we designed a multimodal deep learning framework for quick product development, and how the Room Type Classification models built upon it helped us better understand the homes on our platform.

By Wayne Zhang, Mia Zhao, Yuanpei Cao

Deep learning (DL) is helping us at Airbnb serve our stakeholders better and enhance belonging. For example, we use it for search ranking models, fraud detection models, issue prediction models in customer support, content understanding on listings and many other areas. A broad division within DL are classification tasks, which use a set of features to predict labels or categories within a taxonomy, such as predicting room types from listing images.

The complexity of classification problems in the real world is extremely high since there are all sorts of different signals and potentially useful features in production. Data scientists and machine learning engineers always aim for improving model performance (e.g., precision, recall, etc.) as much as possible when building the classification models, by adequately incorporating the rich signals in the problem domain. This has proven to be helpful from literature review across academia and industry.

This post will introduce WIDeText, a multimodal deep learning framework built by Airbnb that enables developing and productionizing classification systems at ease, and walk you through an example of using WIDeText to build a state-of-the-art room type classifier.

Overview of WIDeText based model architecture having Text, Wide, Image and Dense channels

Background of Multimodal Classification Tasks

Typically, ML engineers and data scientists start with a simple classification problem — there are core features, such as text or image, and the target is to train a DL model encoding and make predictions of their best category. For example, there are room images on Airbnb listings, and the model wants to predict the room type based on them.

Single model design for classification task

However, we realized that many other features are also of great value for understanding the room types in the Airbnb listings. For instance, an image caption saying “living space w/ dining area, TV and Electric Fireplace”, which is a text feature, provides strong signals to classify the room as a living room.

As we incorporate more features, the model becomes a lot more complicated. First, more types of features and “experts” (encoding or embedding models) in charge of understanding each of them are added to boost its performance. Second, a more sophisticated “decision maker” (classifier) that can summarize all of the voices and make predictions is needed. Here is where engineering overheads become significant.

More complicated multimodal model design for classification task

For most of the cases, one has to build ad-hoc model architectures and a feature processing pipeline, write a training and evaluating script, and deploy the model into the pipeline or endpoint. In the meantime, great efforts have to be made to keep track of every detail in order to properly review the work, which are shown as steps 3 to 7 in the diagram. This turns out to be a major portion of the machine learning development process. It is neither time efficient nor scalable in the long run.

Briefing steps on developing and deploying ML models in Airbnb

We propose a unified framework to simplify, expedite, and streamline the development and deployment process for this type of multimodal classification tasks.

WIDetext — Multimodal Deep Learning Framework

By taking a look at several multimodal classification tasks and the features they used, it’s not hard to identify that the features fall into several common buckets and can be tackled by specific “experts” (model architectures).

Image channel

  • Examples: Listing images, amenities images, etc.
  • MobileNet, ResNet, etc. are experts on this. We will cover more details in the section later.

Text channel

  • Examples: Image captions, reviews, descriptions, etc.
  • There are many NLP models such as CNN, LSTM, transformers, etc. that are experts on it.

Dense channel

  • Examples: Categorical features, numerical features, such as amenity types, image quality scores, location information, etc.
  • GBDT is one of the experts.

Wide channel

  • Existing embeddings which are generated by experts somewhere else, and can be directly leveraged by our decision-maker (classifier).

Thus, at Airbnb, we developed an in-house PyTorch based multimodal deep learning framework, named WIDeText: Wide, Image, Deep, and Text, to enable developing and productionizing classification systems at ease.

The core concept here is model fusion. We can leverage the state-of-the-art model architectures for different types of features and assemble the embeddings to boost the final classifier’s ability. It provides an experience of building deep learning models in the way of building blocks — one can easily plug in / off channels and adjust their architecture per the objective.

Let’s take a closer look at how WIDeText gets you covered on:

  1. Model prototyping and developing (Configure channels and architecture)
  2. Training and deploy (Build pipeline in production)

Model Development — Json based model configuration

WIDeText supports using JSON alike to configure the models: every channel in the framework is pluggable and configurable in terms of their architecture and training hyper-parameters.

Hyper-parameters are required for WIDeText classifiers. Other channels’ can be set up optionally for different user cases.

Below snippet shows a dummy example of setting it up for a multimodal classification model.

A dummy example of setting up a multimodal classification model using WIDeText

To help visualize this WIDeText based model the snippets just built, it includes

  • VGG based image channel
  • CNN based text channel
  • GBDT based dense channel
  • Wide channel
  • MLP based classifier
  • and their training and evaluating hyper-parameters
Visualization of the WIDeText based model having Text, Wide, Image and Dense channels

Training and Deployment — integrated with Airbnb’s infrastructure

The integration with Airbnb’s machine learning infrastructure makes model development and deployment easy.

For context, Airbnb’s Bighead Machine Learning Infrastructure provides users with a composable, consistent, versatile interface for the creation of a self-contained model with minimal “glue” code. The Bighead transformer interface (to be noted, this has nothing to do with the other famous transformer architecture in the deep learning domain) provides a way to define stateful or stateless functions that transform a collection of named feature tensors to another collection of named feature tensors. Each transformer can be fitted and configured per your use case. Each transformer can perform a transform given a data source, then later applied on new data for inference. More importantly, a group of transformers can be connected to a directed acyclic graph (DAG) called ML pipeline.

We provided a wrapper to make any WIDeText based model a Bighead Transformer. This can be combined with existing preprocessors, transformers, etc. to build and deploy an end-to-end machine learning pipeline together with the WIDeText transformer.

For example, in the pipeline shown below, we added several preprocessors for different types of input before feeding them into the WIDeText transformer, which are JPEGResizeDecoder for image data, and one hot encoder, scaler and feature combiner for wide and dense features.

An example of the integrated machine learning pipeline having multiple transformers, such as WIDeText transformer, several preprocessors, etc.

With this constructed Bighead pipeline, one can use its unified APIs to pour their data in for training and evaluating, and deploy the pipeline to production and expose as an online endpoint.

The WIDeText framework has been widely adopted by production teams. Multiple production models have been built and shipped in Airbnb’s products, such as issue prediction in customer support, and experience tagging to better understand experience listings on Airbnb. In the next section, we will describe an example application of WIDeText, Airbnb’s room type classification model.

Application on Room Classification

As of June 2020, Airbnb has more than 390M active listing photos. As the saying goes, “a picture is worth a thousand words”. Listing photos are key decision factors when guests make reservations, and photo room classification, e.g., bedroom, kitchen, etc., is an important process in providing the best search experience to guests. For example, different room categories are distinguished in the photo gallery for Airbnb Plus listings to provide a better search experience. In Airbnb, we have been using a room classification model based on convolutional neural networks (CNN) on room images. However, as our platform evolves, we come to have a diverse set of features over multiple modalities. These come from different input sources that describe an Airbnb home photo. For example, Figure 1 shows a home visualization typically available on Airbnb. The home photo, the image caption written by the host, and the listing geo-location (city and country) are provided. The multi-modularity leads us to create a joint representation of an image, the text description, and the geo-location category for room classification.

An example of home photo with its image caption and home location available from Airbnb. The features selected in room type classifier are: 1) image thumbnail; 2) image caption text displayed on airbnb: “Living Space w/ TV”; 3) image technical features: image size, width, image height, and image quality; 4) computer vision features: amenity detection results (i.e. n_couch = 2, n_tv = 1, n_bed = 0, etc); 5) listing geo-location features: country = ‘US’, and region = ‘north america’;

Inspired by these multimodal data sources, we leverage WIDeText to enable developing and productionizing room classification systems at ease.

Architecture of WIDeText-based Room Type Classifier

Channel choice for room type classification

Text Channel

The image caption text uploaded by Airbnb hosts is used in the text channel. Room type classification contains an average text length of 4 words, which is relatively short. Since CNN-based text architecture can effectively and efficiently capture local relationships on short phrases, we choose to plug CNN as a plug-in text channel. Text channel allows transfer learning from word vectors pre-trained from a domain-specific larger corpus. All room-related descriptions are served as domain corpus, and multilingual word embedding is pre-trained by first applying the skip-gram model to generate monolingual embedding and then aligning them in a zero-shot learning fashion.

We apply multiple filter sizes to capture different region sizes and multiple filters for the same region size to learn complementary features in the same region. 1 — max-pooling is used to extract a scalar from each feature map and then optional dropout followed by a fully connected layer can be used to further shrink the dimension.

Dense Channel

Categorical features like listing geo-location (i.e., country and region) are essential signals in room type classification. As a concrete example: “house” as a listing type is widespread in suburbs but far less common in cities, so entrance as a room type tends to be very different between these two places.

Entrance to Home in suburb (left) vs. city (right)

In the dense channel, those categorical features are encoded as a one-hot encoder representation, then further learned from backpropagation through fully connected layers.

Numerical features like the image size/width/height and number of detected amenities, including beds/pillows/microwaves/etc., can also help in predicting the room type. Scaled numerical features are served in a dense channel and then used in feed-forward layers along with categorical features.

Image Channel

We applied multiple models on the test set and compared the performance with a baseline model built on pre-trained mobileNet that only uses image features. It shows that incorporating non-image features using WIDeText significantly improved the overall performance across different room type categories. Finally, we launched a WIDeText model using ResNet 50 due to the trade-off between accuracy and computation time.

Table: performance comparison between the baseline model trained by pre-trained MobileNet architecture without using non-image feature and the proposed Room Type classifier trained by WIDeText architecture based on pre-trained MobileNet, fine-tuned ResNet 50 and EfficientNet B4 image channel.


In this post, we reviewed how we designed a multimodal deep learning framework for quick product development and demonstrated that the models built upon it greatly improved prediction accuracy in the room classification task.

Here are a few key take-aways from ML practitioners who have been using the WIDeText framework to train the multimodal classifiers:

First, WIDeText Framework helps speed up the model development and deployment process from weeks to days. This end-to-end training and deployment framework empowers modelers to utilize as many raw features as possible, making the model debugging easier.

Second, it is common practice to compare each channel's architecture choice separately. We confirmed the effectiveness of doing that in our multi-model frameworks. We started with experimenting different image channel choices for room type classification without changing other channels. As a consequence, we could independently select the optimized archetype for each channel.

Third, distillation from GBDT to Neutral Network is recommended for better performance if numerical features play an important role among all features, as it allows missed or unscaled input values. If categorical features are essential, choosing embedding-based dense layers yields better performance.


Thanks to Bo Zeng and Peggy Shao for contributing to the WIDeText framework, adopting it in their work, and providing valuable feedback. We would also like to thank the contributors of open source libraries such as PyTorch and the original inventors of MobileNet, ResNet, and EfficientNet. We benefit tremendously from this friendly open source community. Finally, we appreciate Ari Balogh’s support, and thank Joy Zhang, Hao Wang, and Do-kyum Kim for their kind help in proofreading.

Further Reading

Bighead: A Framework-Agnostic, End-to-End Machine Learning Platform goes into the details of the Airbnb Machine Learning Infrastructure. DSAA’2019

Categorizing Listing Photos at Airbnb describes deep learning models applied on Airbnb photo categorization

We always welcome ideas from our readers. For those interested in contributing to AI/ML work in Airbnb, please check out our open positions.