Embedding categorical variables with a pre-trained language model shows enhanced deep neural net predictive power / speed
For the sake of demonstration I’ve prepared two different versions of a model meant to predict daily price of AirBnB properties.
This project explores the differences in model performance in terms of speed and accuracy achieved when all categorical variables in a dataset are replaced with a numerical representation derived from a pre-trained language model.
The developed model is then compared to a version based on data where categorical variables are left in their original form. For the purposes of this demonstration I will be using the Boston AirBnB dataset available on Kaggle from here — https://www.kaggle.com/airbnb/boston/data
Categorical vars transformed based on language embedding:
Each categorical variable can be replaced by vectors representing each set of words. Note that in the case of FastText embeddings are calculated by first parsing words into subwords and then taking the average of all embeddings to produce a single representation of the full “phrase”.
As you will see replacing our categorical variables with continuous variables results in little to no performance hit but increases model speed drastically and leaves a lot of room for improvement.
But first some house keeping — Data PreProcessing:
Before diving deeper into the primary question let’s first let’s do some house keeping to prep our datasets by loading the relevant tables and then by conducting data exploration to answer some basic questions:
Some preprocessing needs to be done to take care of the null data points in the listings dataset.
All null values in this analysis are replaced by placeholder values to enable the model to “recognize” when nulls are present.
As part of this process the data was pre-processed using a CRISP-DM framework to answer some basic question about the data…
Since AirBnB is a crowd sourced platform for renting out real estate one can imagine that being able to price efficiently would be a useful skill.
In order to be able to model price effectively it’s useful to understand our data structure and how the different datasets come together.
For this exercise we have 3 different datasets.
- listings.csv — contains property and host specific information from a recent web scrape.
- reviews.csv — contains reviews for the various properties in listings.csv. This can be joined using the listing_id key.
- cal.csv — contains dates of stays and prices for each stay, this is where our time series and pricing information will be coming from and will serve as our source for dependent variables. Both listings.csv and reviews.csv can be joined to this table using the listing_id field.
Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com
After tables are joined let’s look at the distribution of pricing across our properties to get an idea of the business, additionally let’s look for any obvious correlations with pricing.
Correlation matrix shows that host-listing-count, number of people accommodated by a property, bathrooms, bedrooms, and beds have the strongest influence on price…these variables in isolation however don’t tell the whole story as we will see when we look at predicting price. Like most things in life pricing is the product of multiple features.
Integrating Review Data
Taking advantage of the review data in many instances is tough, however, if you happen to have a sentiment analysis model handy you can quantify the number of positive and negative reviews on a per property basis. Using a previously trained model based on one of the FastAI classes I took.
For information on how to train an IMBD based sentiment analysis look here:
For the sake of staying focused on the topic at hand for this article I am just loading in a csv file with listing_ids and positive/negative sentiment counts in different columns where Sentiment was analyzed using inference from the IMBD model. There is probably some more fine-tuning I could do here but for the sake of demonstrating the use of embeddings in time series analysis I’m not worrying about that for this analysis.
Prepping the dataset for Time-Series
The final steps needed in order to run this through a neural network include the following:
- Parse dates further so that the model has an easy time recognizing things such as seasonality and other time-of-the-year dependent factors. FastAI tabular has a convenient API that allows us to easily parse dates using the function add_datepart()
- Normalize all continuous variables to have a mean of 0 and a standard deviation of 1. FastAI tabular has a convenient API for this as well that we will touch on in a moment.
Deep Networks for Price Prediction
Luckily with Neural Networks we don’t have to manually feature engineer interacting features as the architecture takes care of that for us. To keep things easy I’m once again leveraging FastAI:
Finalize the Data:
Notice that I’m including the listing_id as a plain categorical variable — this allows the model to “know” that it’s looking at a different instance of the same listing if appropriate.
One other thing to notice is that our dependent variable has been converted to log price as it becomes easier for the model to predict due to a number of statistical reasons that I won’t get into.
Next Define the Model:
After defining the network we are ready to train the model — let’s compare results from the version of the model trained using the embeddings from FastText in place of the original categorical variables:
From this experiment I was able to create a highly accurate price prediction model that beats most scores that I’ve seen on Kaggle. The second point to notice is that by including a scalar representation of the categorical variables training time per epoch is a whopping 60% FASTER.
Conclusions / Why is this so cool?
This experiment shows (at a high level) that language models are useful for representing categorical variables in any analysis since embedded representations convey meaning and relative relatedness across logically named variables. Accordingly, these differences conveyed in via number instead of words allow the model to distinguish between variables that are similar to one another vs those that are inherently different.
Further work would need to be done, but I suspect that using full representations (rather than converting the embedding to a scalar like I did here) would result in large increases in predictive performance. Additionally, using state-of-the-art architectures would likely further improve the model’s predictive power as well.
Either way, the increases in processing speed alone is a good reason to try this out on a dataset of your own!
— — — — — — — — — — — — — — — — — — — — — -
Any thoughts on the use of transfer learning to represent categorical variables in structured data?
Let me know what you think regarding the use of transfer learning and Embeddings in time series analysis. I haven’t seen much if anything written on this so it would be great to get the community’s feedback on this methodology.
Note that detailed notebooks are available on my github here: https://github.com/BradEvanDavis/time-series