Leveraging Language Models for Data Enrichment

Published in

Casavo

10 min readJun 16, 2023

If you’ve been reading our latest AI blog posts (and in case you haven’t, I strongly encourage you to catch up here, here, and here), you likely understand the vital importance of data within Casavo. Data serve as the cornerstone for every decision we make, whether it’s determining which houses to acquire, estimating their prices, or devising optimal dismissal plans. Needless to say, in order to make informed decisions and take the most effective actions, we must have access to the highest quality data available, and if possible, refine them even further.

Eva, our Automatic Valuation Model, is the component that particularly requires reliable data, more so than any other component. Eva is a Machine Learning model designed to quickly estimate the Fair Market Value of an asset. It has been trained using an extensive dataset comprising numerous Real Estate transactions, enabling it to capture the market dynamics specific to each city where we operate. To optimize its performance, it is essential to provide Eva with comprehensive property features, such as location, size, floor level, and renovation status.

In particular, the renovation status of a property greatly influences its appraisal, with a noticeable price difference between a recently renovated property and one requiring significant work. This is especially true for brand-new, never-occupied apartments, which tend to be high-priced. However, while precious, this feature is not always available for all properties. And this is where Language Models come into play.

Although property data may not explicitly include the “renovation status” feature, they often come with textual descriptions provided by brokers on listing platforms. These descriptions contain a wealth of information about the property, including the renovation status, albeit in unstructured text format.

Details of an Italian property description that indicate the renovation status of a property

Fortunately, there are various Machine Learning models available that can assist in extracting information from such text. However, to train these models, it was necessary to label each listing description, indicating whether the described apartment was “not renovated,” “renovated,” “habitable,” or whether the renovation status could not be determined from the text. So, guess who we asked for data labeling?

GPT for Data Annotation

If you’ve been even remotely aware of recent developments in the field of natural language processing, you’re undoubtedly familiar with ChatGPT and its underlying model, GPT. GPT, short for Generative Pretrained Transformer, is a highly advanced Large Language Model created by OpenAI. It has showcased astonishing performance across various natural language processing tasks, including question-answering, reasoning, and text classification. OpenAI has made the power of GPT accessible to developers through a dedicated API that simulates interactions with a chatbot.

In light of this, we devised a specific prompt that instructs GPT to classify listing descriptions into four main categories: not-renovated, habitable, renovated, and new. Additionally, we included a bonus class, not-available, for cases where the renovation status is not explicitly mentioned in the text. The development of our prompt involved a systematic process of trial and error. We began by manually labeling a set of examples and then posed the same queries to GPT. Through a thorough comparison of the model’s responses with our own labels, we fine-tuned the prompt to align GPT’s answers as closely as possible to our intended classifications. This iterative approach was repeated 2–3 times until we reached a level of satisfaction that allowed us to proceed with extensive labeling across all available data.

Asking GPT to label a Spanish description

For instance, we designed a prompt tailored for French listings, resembling something along the lines of:

You are a helpful assistant that is extremely skilled 
at extracting information from French listing descriptions in a structured way. 
For each description, I need you to accurately extract 
the renovation status of the property. 

In addition, you should always provide the reason for labeling 
a property with the specific renovation status by explicitly 
mentioning the relevant passage in the description. 
If the information is not available, indicate it accordingly.

Here is the desired output format:
Reason: {text}
Renovation status: {text}

You can choose from the following five labels:

* new: if the property is new, under-construction, or will be built in the future

* renovated: if the property has been recently renovated, but it's not new

* habitable: if the property is in good condition and/or doesn't require further renovation work

* not-renovated: if it's explicitly stated that the property is not 
renovated and/or needs renovation work

* not-available: if the renovation status is not explicitly mentioned or precisely specified

This is the description:
{text}

Model building

Once GPT has labeled a subset of the data, one might be tempted to use the APIs for the inference across the whole dataset. However, given the ever-increasing size of the dataset, we have deemed this route unfeasible, both from a time and cost perspective.

Moreover, textual descriptions of properties are pretty uniform: a simpler and lighter model with access to high-quality labeling should be capable of predicting renovation status with reasonable accuracy.

Once we collected a reasonable amount of data, we thus built our own Language Model. We actually trained three different models on the labeled data and then combined their predictions with the Stacking method. Let’s go through each step.

DistilBERT

In our resource-constrained environment, where we face the daunting task of processing tens of thousands of data points on a daily basis, we sought a solution that would strike a balance between performance and efficiency: DistilBERT, the compressed version of BERT achieved through knowledge distillation and carefully designed architecture. Its ability to retain much of BERT’s power while being more lightweight made it an ideal candidate for our needs.

Thankfully, Hugging Face’s API came to our rescue, providing us with a seamless experience in fine-tuning DistilBERT on our labeled data. The beauty of Hugging Face’s API lies in its simplicity. We effortlessly attached a Classification Head to the backbone of DistilBERT, leveraging its powerful text embeddings to tackle our specific classification task. This meant that we could bypass arduous and time-consuming preprocessing steps, as the tokenizer and network were adept at understanding and extracting the relevant information for our objective. This not only saved us precious time but also reduced the risk of inadvertently discarding crucial contextual clues in the data.

One of the remarkable strengths of BERT-like models, including DistilBERT, is their consistent ability to outperform more complex approaches across a wide range of NLP tasks. Their comprehensive language understanding and contextual embeddings enable them to capture nuanced patterns and dependencies in text, resulting in superior performance on tasks such as sentiment analysis, named entity recognition and question answering.

However, it is essential to acknowledge that these models come with a trade-off. Due to their large number of parameters, deploying these behemoths in a production environment can be a challenge. The sheer size of the model poses constraints in terms of memory requirements, computational resources, and inference speed. Thus, it becomes crucial to carefully consider the specific infrastructure and deployment constraints before incorporating such models into a real-time production system.

WordModel

Since we aimed at something reasonably fast, we also turned to classic NLP methods. In particular, we resorted to a Bag of Words representation of our corpus, training a TF-IDF vectorizer to transform descriptions into vectors of numbers.

The scikit-learn implementation offers two ways to break text into meaningful elements, a procedure called Tokenization: by word and by fixed-length groups of letters. Let’s first talk about how we employed the former.

Given a piece of text, our model must be able to understand where a word ends and another begins. This can be done using Regular Expressions (as in the scikit-learn default) or methods crafted for the language at hand, such as those offered by the Spacy and NLTK libraries.

Moreover, it is often convenient to map words to a base form, in order not to have an exploding number of features that carry the same meaning. Two popular ways to do so are Lemmatization and Stemming: we won’t go into the details of the procedures, suffice to say that also, in this case, there are several pre-built methods to choose from in NLTK and Spacy, plus language-specific extensions.

One should not, however, blindly apply Lemmatization or Stemming, since the original word could carry meaning that would get lost: think for instance about “renové” vs. “à rénover” in French. We thus also considered the possibility of not going to the root form at all.

Once we have converted text into a vector of numbers, it can be digested by a Machine Learning model. To find a sweet spot between speed and predictive power, we turned to Gradient Boosting, and more specifically to the library LightGBM.

Both the TF-IDF vectorizer and the LightGBM classifier have a host of hyperparameters, the effect of which on prediction quality can be hardly overstated. To find the optimal set of hyperparameters we defaulted to the popular library Optuna. Feature extraction and classification were chained into a Machine Learning pipeline, and all the hyperparameters were optimized together in the same run.

Rather than choosing a Tokenization method and a Lemmatization or Stemming method (or none at all), we considered the different choices as hyperparameters, on the same footing as the others.

CharModel

The TfidfVectorizer of scikit-learn offers a conceptually simpler method to split a text into entities, as opposed to words: n-grams of letters. While intuition may tell us they should be less expressive than words in carrying meaning, they offer several advantages:

They don’t depend on rules for word separation;
They are less sensitive to misspelled words;
They can easily capture interactions between consecutive words.

For these reasons, we also built a model based on this feature extraction method.

On the practical side, it parallels the WordModel, apart from the hyperparameters regarding Tokenization, Lemmatization, and Stemming.

Stacking

Having developed and trained three different models, each coming with its own advantages and peculiarities, rather than picking one we opted for combining them: combining model predictions often reduces noise and improves accuracy.

In particular, we chose the Stacking method. When stacking, one meta-model is trained on the predictions of the single base models. In the case of a classification problem like ours, usually, the predicted probabilities of the different classes are used as features of the meta-model.

As for the meta-model, we employed a standard Logistic regression: this effectively calculates a weighted average of the predicted probabilities and then applies a logistic function on top.

Note that both Stacking and its cousin technique Blending involve using the base models to predict hold-out sets. A simpler, naive approach would be to train base models on the whole training set, and use predictions (again on the training set) as features for the meta-model. However, if base models overfit, residuals will be underestimated, and the meta-model won’t be accurate. We don’t have this risk if the meta-model is not trainable, for instance when we take simple averages or apply majority voting: this approach, however, carries another name (Bagging).

It is worth noting that, in order for Stacking to be useful, base models should be sufficiently different (i.e. their predictions should be somewhat uncorrelated), even if paradoxically this means that they have lower quality on average. This explains why it is common to see Stacking applied to neural networks and tree-based models, as in our case.

Our final model to classify property descriptions

Results

Although each individual model performed well on its own, we were convinced that combining them would further enhance performance, considering their distinct approaches to processing input.

Let’s examine the final Confusion Matrix obtained from the test set:

To start, a Confusion Matrix is considered good when the majority of samples align along the diagonal, indicating that the predicted labels mostly match the ground truth. This holds true for our matrix.

It is noteworthy that the model faces the most difficulty in the not-available class, which is also the most frequent. This is unsurprising since GPT, responsible for labeling these samples, could only analyze one data point at a time. In contrast, our models optimized their loss function by considering multiple samples simultaneously and likely observed that many samples labeled as not-available shared similarities with other classes. In fact, the most common error made by Brunelleschi is confusing not-available with habitable, two classes that are easily interchangeable.

Nevertheless, we conducted a manual analysis to identify instances where our model struggled, and it turned out that even when errors occurred, they were acceptable because the text itself could be interpreted in various ways.

Conclusion

At Casavo, we are dedicated to harnessing the power of innovation to optimize our operations. By utilizing Language Models like GPT for data annotation, we have successfully extracted valuable information from unstructured text, revolutionizing our ability to classify renovation statuses accurately. Looking ahead, we are excited about exploring additional frontiers, such as extracting information from images, to further enhance our data-driven decision-making processes. With a steadfast commitment to innovation, we are poised to continuously refine our methods and provide the highest quality service to our customers in the ever-evolving real estate market.

This blog post is co-authored by Alessandro Paticchio and Marco Gigli, Data Scientists in the Pricing team of Casavo.