Fine-tuning BERT for a regression task: is a description enough to predict a property’s list price?

Published in

ILB Labs publications

13 min readSep 13, 2021

The purpose of this article is to provide a practical example of fine-tuning BERT for a regression task. In our case, we will be predicting prices for real-estate listings in France.

In a previous post, we constructed a supervised regression model based on an LGBM to predict the list price from a tabular set of numerical and categorical variables. Today, we will compare BERT’s performance at predicting the same list prices based solely on the textual description of the properties.

The data set

The data set we are using today is the same as in the previous article. It includes data from 788K listings scraped from French real-estate websites. Alongside the numerical and categorical features used in the benchmark LGBM model, we collected a textual description of the property for sale.

These descriptions tend to repeat at least part of the information contained in the other features but in the form of natural language. The information is often presented in a subjective manner, putting forward the most attractive features of the property. These descriptions can also provide details that are specific to the property, that do not fit into any set of common features, such as “renovated by an architect”.

All in all, we are led to believe that the textual feature should somehow, on it’s own, be enough to achieve good price prediction performances. But how do we process it?

Luckily for us, Natural Language Processing is a very active field of research. Transformer architecture and attention mechanisms were first presented in the 2017 paper called “Attention is all you need” and in 2018, Google presented a breakthrough transformer-based language model: BERT.

This post will not detail the inner workings of Transformers, attention mechanisms and BERT but if you want to learn about it, here are a few references that will help you understand the key concepts:

If you like videos, check out CodeEmporium’s video about Transformers and BERT,
If you prefer reading, have a look at this post.
You can also watch ChrisMcCormickAI’s series on BERT and his tutorial on fine tuning BERT for document classification that inspired some of the content of this post.

Often, blogposts, articles and tutorials explain how to use BERT for supervised classification problems such as document classification. In this article we will see how to use BERT for a regression task.

BERT and CamemBERT

As you have probably noticed by now, the textual data we collected is in French… Luckily, researchers at Facebook and Inria teamed up to produce CamemBERT. No, not cheese…, but a “state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French sub-corpus of OSCAR” as stated on their website.

The transformers library

For our use-case, we will use the transformers library implementation of CamemBERT. The transformers library, developed by HuggingFace, seems to have become the go-to library for pre-trained transformer architecture models like BERT in PyTorch.

HuggingFace manages to keep the same or at least a very similar interface from one model to another in the transformers library. This article’s code can thus easily be adapted to fit your needs if your are working on text in another language or just want to try another model out.

Pre-processing data for BERT

We will be using the same training and validation data as in our previous post. To know more about the data and how it is pre-processed, you can check out our previous post. Let’s import and pre-process our data sets again and this time focus on the “description” variable.

import pandas as pd
from preprocessing import preprocessing_pipelinetrain_path = "train_data.csv"
val_path = "test_data.csv"train_data = pd.read_csv(train_path, sep=',', index_col=0)
val_data = pd.read_csv(val_path, sep=',', index_col=0)train_data = preprocessing_pipeline.fit_transform(train_data)
val_data= preprocessing_pipeline.transform(val_data)df = train_data[['id_annonce', 'description', 'prix']]

The descriptions average a length of 154 words and 928 characters. Ignoring stop words, the most common words are clearly in the lexical field of real estate: house, bedroom, kitchen, land…

Description length distribution based on the whole training set (left) and most common word lemmas in a random sample of 20K descriptions excluding stop words (right)

Common embedding approaches in NLP such as Word2Vec or FastText usually require lemmatization and stop words to be removed, but this is not the case for BERT. The general structure of the text should actually not be modified since BERT relies on it to learn and interpret context.

Our pre-processing will thus be limited to:

Converting text to lower-case
Standardising representations of a same entity such as “€”, “euro” and “euros” or “m2” and “m²”,

import redef treat_euro(text):
    text = re.sub(r'(euro[^s])|(euros)|(€)', ' euros', text)
    return textdef treat_m2(text):
    text = re.sub(r'(m2)|(m²)', ' m²', text)
    return text

Cleaning out certain patterns that are unlikely to be meaningful such as URLs, phone numbers, emails and bank account references.

def filter_ibans(text):
    pattern = r'fr\d{2}[ ]\d{4}[ ]\d{4}[ ]\d{4}[ ]\d{4}[ ]\d{2}|fr\d{20}|fr[ ]\d{2}[ ]\d{3}[ ]\d{3}[ ]\d{3}[ ]\d{5}'
    text = re.sub(pattern, '', text)
    return textdef remove_space_between_numbers(text):
    text = re.sub(r'(\d)\s+(\d)', r'\1\2', text)
    return textdef filter_emails(text):
    pattern = r'(?:(?!.*?[.]{2})[a-zA-Z0-9](?:[a-zA-Z0-9.+!%-]{1,64}|)|\"[a-zA-Z0-9.+!% -]{1,64}\")@[a-zA-Z0-9][a-zA-Z0-9.-]+(.[a-z]{2,}|.[0-9]{1,})'
    text = re.sub(pattern, '', text)
    return textdef filter_ref(text):
    pattern = r'(\(*)(ref|réf)(\.|[ ])\d+(\)*)'
    text = re.sub(pattern, '', text)
    return textdef filter_websites(text):
    pattern = r'(http\:\/\/|https\:\/\/)?([a-z0-9][a-z0-9\-]*\.)+[a-z][a-z\-]*'
    text = re.sub(pattern, '', text)
    return textdef filter_phone_numbers(text):
    pattern = r'(?:(?:\+|00)33[\s.-]{0,3}(?:\(0\)[\s.-]{0,3})?|0)[1-9](?:(?:[\s.-]?\d{2}){4}|\d{2}(?:[\s.-]?\d{3}){2})|(\d{2}[ ]\d{2}[ ]\d{3}[ ]\d{3})'
    text = re.sub(pattern, '', text)
    return text

Surely more text cleaning can be done but let’s stick with this for now.

def clean_text(text):
    text = text.lower()
    text = text.replace(u'\xa0', u' ')
    text = treat_m2(text)
    text = treat_euro(text)
    text = filter_phone_numbers(text)
    text = filter_emails(text)
    text = filter_ibans(text)
    text = filter_ref(text)
    text = filter_websites(text)
    text = remove_space_between_numbers(text)
    return textdf['cleaned_description'] = df.description.apply(clean_text)

Tokenization

BERT takes as input sequences of equal length. Input sequences must thus be padded or truncated at a given length with special characters explicitly indicating the actual start and end of the sequence. The special start character is referred to as the “[CLS]” token. We’ll talk about it again later.

These input sequences are then split into two :

A sequence of “input ids”, mapping the words to tokens from the vocabulary the model was pre-trained with,
A binary sequence of “attention masks” that indicate whether the “input id” at a given index is a word or padding.

The transformer library provides a tokenizer module that does it all for us:

from transformers import CamembertTokenizertokenizer = CamembertTokenizer.from_pretrained('camembert-base')encoded_corpus = tokenizer(text=df.cleaned_description.tolist(),
                            add_special_tokens=True,
                            padding='max_length',
                            truncation='longest_first',
                            max_length=300,
                            return_attention_mask=True)input_ids = encoded_corpus['input_ids']
attention_mask = encoded_corpus['attention_mask']

The maximal input sequence length supported by BERT is 512 tokens but using shorter input sequences tends to improve and speed up the training process. We’ll take a maximum input sequence of 300. To avoid any description being truncated and because it concerns only a small proportion of the observations, we chose to put aside descriptions that required truncation given a maximum length of 300.

import numpy as npdef filter_long_descriptions(tokenizer, descriptions, max_len):
    indices = []
    lengths = tokenizer(descriptions, padding=False, 
                     truncation=False, return_length=True)['length']
    for i in range(len(descriptions)):
        if lengths[i] <= max_len-2:
            indices.append(i)
    return indicesshort_descriptions = filter_long_descriptions(tokenizer, 
                               df.cleaned_description.tolist(), 300)input_ids = np.array(input_ids)[short_descriptions]
attention_mask = np.array(attention_mask)[short_descriptions]
labels = df.prix.to_numpy()[short_descriptions]

Tokenization of an input sequence for BERT

Input formatting

Our data is already split into a training and a validation set. The validation set will be used exclusively to evaluate the performance of the final trained model. The training set is the same 300K observation set we used in the previous post. To monitor the performance of the model during training, another 10% of the training data is put aside into a separate test set.

It is important to use a separate test and validation set since the performance metrics observed during training could influence certain decisions regarding the training process such as the number of epochs and thus bias the final evaluation of the model.

from sklearn.model_selection import train_test_splittest_size = 0.1
seed = 42train_inputs, test_inputs, train_labels, test_labels = \
            train_test_split(input_ids, labels, test_size=test_size, 
                             random_state=seed)train_masks, test_masks, _, _ = train_test_split(attention_mask, 
                                        labels, test_size=test_size, 
                                        random_state=seed)

In a deep learning regression task, it can arguably be helpful for the stability of the training process to have scaled the target variable:

from sklearn.preprocessing import StandardScalerprice_scaler = StandardScaler()
price_scaler.fit(train_labels.reshape(-1, 1))train_labels = price_scaler.transform(train_labels.reshape(-1, 1))
test_labels = price_scaler.transform(test_labels.reshape(-1, 1))

Let’s now convert the training and test sets to PyTorch-friendly input formats.

For each set, the input ids, masks and labels are each converted to tensors and then put together into a TensorDataset. The TensorDataset is then packed into a DataLoader object which is an iterator that will present the inputs in batches of 32 observations to our model at training.

import torchfrom torch.utils.data import TensorDataset, DataLoaderbatch_size = 32def create_dataloaders(inputs, masks, labels, batch_size):
    input_tensor = torch.tensor(inputs)
    mask_tensor = torch.tensor(masks)
    labels_tensor = torch.tensor(labels)
    dataset = TensorDataset(input_tensor, mask_tensor, 
                            labels_tensor)
    dataloader = DataLoader(dataset, batch_size=batch_size, 
                            shuffle=True)
    return dataloadertrain_dataloader = create_dataloaders(train_inputs, train_masks, 
                                      train_labels, batch_size)test_dataloader = create_dataloaders(test_inputs, test_masks, 
                                     test_labels, batch_size)

Model architecture

Now for the actual architecture of our model. BERT is composed of an embedding layer, and 12 transformers stacked one after the other.

For each input sequence, BERT’s output is a same-size sequence of vectors. These vectors are the final hidden states representing each input token. Each of these vectors are made up of 768 floats.

In BERT’s paper, it is indicated that only the final hidden state of the first token in the output sequence should be used for classification tasks. In other words, only the vector representing the “[CLS]” token mentioned earlier, should be used.

For our regression task, we will do the same thing but rather than add a dense pooling layer for classification behind it, we will add a dense linear layer with dropout that will serve as our final regression layer.

Implementing the model in PyTorch

The PyTorch code to implement this model is actually quite straightforward. Our CamembertRegressor is a PyTorch nn.Module with two new attributes:

an instance of the pretrained CamembertModel from the transformers library,
a single layer regression network taking as input a 768 long input and giving a single output value

The forward method passes the tokenized input trough the CamembertModel and collects the 768 long vector corresponding to the “Class Label” output token. It then passes that vector through the regression layer that outputs the predicted value.

import torch.nn as nn
from transformers import CamembertModelclass CamembertRegressor(nn.Module):
    
    def __init__(self, drop_rate=0.2, freeze_camembert=False):
        
        super(CamembertRegressor, self).__init__()
        D_in, D_out = 768, 1
        
        self.camembert = \
                   CamembertModel.from_pretrained('camembert-base')
        self.regressor = nn.Sequential(
            nn.Dropout(drop_rate),
            nn.Linear(D_in, D_out))    def forward(self, input_ids, attention_masks):
        
        outputs = self.camembert(input_ids, attention_masks)
        class_label_output = outputs[1]
        outputs = self.regressor(class_label_output)
        return outputsmodel = CamembertRegressor(drop_rate=0.2)

Setting up the training environment

If a GPU is available, it should be used to accelerate the training process. This code should let PyTorch use a GPU, if available, else it will train the model on CPU.

import torchif torch.cuda.is_available():       
    device = torch.device("cuda")
    print("Using GPU.")else:
    print("No GPU available, using the CPU instead.")
    device = torch.device("cpu")model.to(device)

Optimizer, scheduler and loss function

We will define the optimizer and the learning rate scheduler for our training process. We will use the Adam optimizer with a 5e-5 learning rate as was done in the official BERT paper.

from transformers import AdamWoptimizer = AdamW(model.parameters(),
                  lr=5e-5,
                  eps=1e-8)

To define our scheduler, we must calculate the total number of training steps which is simply the number of training observations multiplied by the number of epochs. We will fine tune our model on 5 epochs.

from transformers import get_linear_schedule_with_warmupepochs = 5total_steps = len(train_dataloader) * epochsscheduler = get_linear_schedule_with_warmup(optimizer,       
                 num_warmup_steps=0, num_training_steps=total_steps)

Our loss function will be the Mean Squared Error loss, the most common loss function for regression problems.

loss_function = nn.MSELoss()

Training loop

Training our model consists in repeating the following actions successively for each batch of input data at each epoch:

Unpack input ids, attentions masks and corresponding target prices,
Load these onto the GPU or CPU device,
Reset the gradients of the previous training step,
Compute the prediction (forward pass),
Compute the gradients (backpropagation),
Clip gradients to prevent exploding or vanishing gradient issues,
Update the model parameters,
Adjust the learning rate.

The following code on it’s own is thus enough to train our model:

from torch.nn.utils.clip_grad import clip_grad_normdef train(model, optimizer, scheduler, loss_function, epochs,       
          train_dataloader, device, clip_value=2):
    for epoch in range(epochs):
        print(epoch)
        print("-----")
        best_loss = 1e10
        model.train()
        for step, batch in enumerate(train_dataloader): 
            print(step)  
            batch_inputs, batch_masks, batch_labels = \
                               tuple(b.to(device) for b in batch)
            model.zero_grad()
            outputs = model(batch_inputs, batch_masks)           
            loss = loss_function(outputs.squeeze(), 
                             batch_labels.squeeze())
            loss.backward()
            clip_grad_norm(model.parameters(), clip_value)
            optimizer.step()
            scheduler.step()
                
    return modelmodel = train(model, optimizer, scheduler, loss_function, epochs, 
              train_dataloader, device, clip_value=2)

However, it is important to regularly compute, store and log the training loss to monitor the learning process. For the sake of brevity, we did not include such code in the post, but you can check out Chris McCormick’s tutorial for inspiration. We computed the MSE loss as well as the R2 score for every 20 batches of training data.

It is also essential to measure the loss on the separate test set punctually during the training process to ensure the model is not overfitting. We computed the MSE loss and an R2 score and the separate test set at the end of each epoch by calling the following function.

def evaluate(model, loss_function, test_dataloader, device):
    model.eval()
    test_loss, test_r2 = [], []
    for batch in test_dataloader:
        batch_inputs, batch_masks, batch_labels = \
                                 tuple(b.to(device) for b in batch)
        with torch.no_grad():
            outputs = model(batch_inputs, batch_masks)
        loss = loss_function(outputs, batch_labels)
        test_loss.append(loss.item())
        r2 = r2_score(outputs, batch_labels)
        test_r2.append(r2.item())
    return test_loss, test_r2def r2_score(outputs, labels):
    labels_mean = torch.mean(labels)
    ss_tot = torch.sum((labels - labels_mean) ** 2)
    ss_res = torch.sum((labels - outputs) ** 2)
    r2 = 1 - ss_res / ss_tot
    return r2

MSE and R2 score on training versus validation sets at each epoch

It also comes in handy to measure and log the duration of the training steps. We trained our model with 150K training observations on a GPU accelerated Google Colab notebook, it took about 1h30 per epoch.

Performance

Now that our model is trained, let’s implement a function to collect some predictions. We will collect our model’s prediction on the test set and use these to compute the same performance metrics as for our benchmark model.

def predict(model, dataloader, device):
    model.eval()
    output = []
    for batch in dataloader:
        batch_inputs, batch_masks, _ = \
                                  tuple(b.to(device) for b in batch)
        with torch.no_grad():
            output += model(batch_inputs, 
                            batch_masks).view(1,-1).tolist()[0]
    return output

We will use the validation set we put aside straight from the start to evaluate the final performance of our model. The descriptions of the validation set have to go through the same pre-processing steps as the training data before we are able to make predictions.

val_set = val_data[['id_annonce', 'description', 'prix']]
val_set['cleaned_description'] = \
                val_set.description.apply(clean_text)encoded_val_corpus = \
                tokenizer(text=val_set.cleaned_description.tolist(),
                          add_special_tokens=True,
                          padding='max_length',
                          truncation='longest_first',
                          max_length=300,
                          return_attention_mask=True)val_input_ids = np.array(encoded_val_corpus['input_ids'])
val_attention_mask = np.array(encoded_val_corpus['attention_mask'])
val_labels = val_set.prix.to_numpy()
val_labels = price_scaler.transform(val_labels.reshape(-1, 1))
val_dataloader = create_dataloaders(val_input_ids, 
                         val_attention_mask, val_labels, batch_size)y_pred_scaled = predict(model, val_dataloader, device)

Our CamemBERT’s outputs are scaled, we will use the same price scaler we fitted in the pre-processing step to convert them back to euros.

y_test = val_set.prix.to_numpy()
y_pred = price_scaler.inverse_transform(y_pred_scaled)

One of the downsides of BERT is that it is relatively slow for prediction. Nevertheless, it manages to challenge the performances of our benchmark model. It actually outperforms it in terms of Mean Squared Error and comes very close to it in terms of mean and median absolute error and reaches a similar R2 score. In terms of absolute percentage error, it is also very close.

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import r2_scoremae = mean_absolute_error(y_test, y_pred)
mdae = median_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
mdape = ((pd.Series(y_test) - pd.Series(y_pred))\
         / pd.Series(y_test)).abs().median()
r_squared = r2_score(y_test, y_pred)

Fine tuned BERT price performance metrics

Recommended next steps: combining numerical, categorical and textual features

An interesting next step would be to construct a model that exploits both the tabular set of features used in the previous post and the textual description used in this one. With more information at hand, we would expect such a model to achieve better performances. That said, a good deal of the information is redundant. The challenge thus relies in extracting information from the features and text that is actually complementary and pertinent for the prediction task.

One approach would be to use text embedding techniques to extract meaningful quantitative features from the text. The embedding features can then be appended to the other numerical and categorical features and used with any supervised regression model.

Another approach would be to fine-tune a similar BERT based network that includes the numerical and categorical features. These features could for example be appended to the final hidden states before being sent through the regression layer.

For both approaches, great care will have to be taken to manage the curse of dimensionality. Indeed, we can expect the dimension of the text embedding features, or the dimension of the final hidden states, to be significantly larger than the dimension of the numerical and categorical features. Dimensionality reduction or weighting methods could thus come in handy.

Conclusion

In our previous article, we used a tabular set of numerical and categorical features to predict list prices. Today, we showed that using only the textual feature, we are capable of reaching very similar performances. These results are satisfactory, far better than random, but can surely be improved.

Most of the posts and tutorials we come across about fine-tuning BERT focus on supervised classification tasks. Through this price prediction use-case, we showed that it can effectively be used for supervised regression tasks too.

Acknowledgements

I would like to give a special thanks to Kawtar Zaher, Anne-Marie Heng, Lorraine Hickson and Louis Boulanger for their great contribution to the this article.