Research Card: Learning from Multiple Sources for Data-to-Text and Text-to-Data

How to generate fluent text from structured data and vice versa by leveraging heterogeneous data sources.

Criteo R&D

Published in

Criteo Tech Blog

3 min readMay 8, 2023

Paper: https://arxiv.org/abs/2302.11269
Category: Natural Language Processing
Revue: AISTATS 2023

Why did we work on this topic (the problem we want to solve)?

Data-to-text (D2T) and text-to-data (T2D) are dual tasks that convert structured data such as graphs or tables into fluent text, and vice versa. For instance, D2T would convert a list of product specifications into a human-readable description, while T2D would take a product description and extract the specifications. Training machine learning models for these two tasks is cumbersome and requires unrealistic assumptions. On the one hand, it is cumbersome because the D2T and T2D models are trained, despite the symmetry of their tasks, as two different models. On the other hand, it is unrealistic because they require the training data to be aligned (i.e., presented as (product specifications, corresponding product description) pairs) and to come from a single source (e.g. product specifications and descriptions exclusively from Amazon) to guarantee some homogeneity in the examples, while in reality, most data is non-aligned and comes from multiple sources (Criteo, Amazon and so on). We get rid of this limiting paradigm with a joint D2T-T2D model that can be trained with non-aligned pairs coming from multiple sources. A scenario that is much closer to what we see in the real world.

What did we find? What did we achieve?

We evaluate our method on several D2T datasets. By learning from multiple sources, our unsupervised model gets similar and sometimes even better results than the supervised single-source counterparts even when these counterparts have been trained in their favorable (but unrealistic) conditions. This demonstrates the utility of learning from multiple heterogeneous sources in a unified way.

How did we proceed?

Since, depending on the source, descriptions can have different styles and product specifications can be represented by different formats, we designed a model that, when facing multiple sources of data, is able to distinguish different styles and formats. One of the challenges is, given a product description, to disentangle the content (what is said) from the style (how it is said). We solved this by designing a variant of a family of models called variational auto-encoders. The other major challenge is how to train a model with non-aligned data. This is like learning to translate from French to English from a French novel and an English novel that are not the same translated version! We used a technique borrowed from machine translation, called back-translation (for instance, given a text in English, translate it into French, then back into English, and check if it corresponds to the original).

What is the originality here?

We are the first ones to design a single model that tackles D2T and T2D in the presence of multiple sources.