How I lost a silver medal in Kaggle’s Mercari Price Suggestion Challenge
using CNNs and Tensorflow
In January 2018, I entered a Kaggle competition called the Mercari Price Suggestion. The competition lasted three months and ended a few weeks ago. The challenge was to build an algorithm that automatically suggests product prices to online sellers, based on free-text descriptions, product categories and a few other attributes.
I joined the competition a month before it ended, eager to explore how to use Deep Natural Language Processing (NLP) techniques for this problem.Those four weeks competing were a sequence of curiosity, engagement, and excitement. I ended up with a single end-to-end Deep Learning model (no ensembles) implemented in Tensorflow, which landed me in the the 35th position (out of 2,384 teams) in the Private Leaderboard. Then came the deception. And I will tell you how I lost my silver medal in that competition.
Mercari, Japan’s biggest community-powered shopping app, wanted to offer pricing suggestions to sellers, based solely on descriptions of the products and some additional categorical attributes. Sellers offered a variety of brand new and used products of different brands, from sweaters to smartphones, which added an interesting complexity to the challenge at hand.
It was a kernel competition, which means that the complete Machine Learning (ML) pipeline runtime was limited to only 60 minutes and computation resources were constrained to a kernel with 4 CPU cores and 16 GB RAM (no GPUs).
In regular Kaggle competitions, competitors download data and train/evaluate their models in their own infrastructure. Usually, winners build a pipeline that runs for hours or days, training dozens of models with different algorithms, which are then ensembled in crazy ways. For large datasets, competitors with more computational resources (either on cloud or on-premises), like Spark clusters or servers with large GPUs, have an obvious advantage over others.
This was not the case for this kernel competition, which somewhat leveled the competitors chances in terms of computing power, and encouraged us to use creativity to optimize our ML pipeline using parallelism, caching and smart algorithms and data structures. Awesome learning so far! You can see a discussion about kernel competitions here.
The competition data structure was pretty simple–a TSV file with a list of product announcements composed by the following attributes: name, item_description, item_condition_id, category_name, brand_name, shipping (1 if the shipping fee is paid by the seller and 0 if it’s by buyer). The target field was price, in US Dollars.
The train set had over 1.4 million products listed, and the test set for Phase 1 had about 700,000 products (no price provided). In Phase 2, the test set size was increased 5x, making it important to prepare the ML pipelines to run within the time and hardware constraints.
Exploratory Data Analysis
An initial analysis revealed that price followed a long-tailed distribution, where most products had lower prices. The median price was $17 and 75% of the products (Q3) cost less than $29. On the other end, there were products priced above $2,000.
I was interested in analyzing the prices distribution by category, brand and shipping (whether or not it was included in the vendor’s price), to understand how prices for similar products can vary. Coefficient of Variation (CV) statistic was used to provide a standardized measure of dispersion for frequency distribution, which is the standard deviation divided by the mean:
Here is a list with the top 10 combinations of categories, brands and shipping, for which larger variation in prices was observed (high Coefficient of Variation)
If we drill down the investigation of the top line (Mary Kay’s face makeup products) we see that whilst the median price is $10 and average price is $17.60, there are products over $100, which are in fact sets with many products, like the examples below. Such information is only available in the textual attributes, which made this a very interesting NLP problem to solve.
Here is a gorgeous Exploratory Data Analysis (EDA) interactive kernel made available by one of the competitors, exploring product descriptions’ main keywords (TF-IDF clustering) and topics (LDA).
Machine Learning Pipeline
I made my solution code for the competition publicly available as a Kaggle kernel. It is an end-to-end Deep Learning model implemented in Tensorflow, with Convolutional Neural Networks (CNNs) layers to process textual data, and Feed-Forward layers to merge with other features.
Feature engineering is crucial for any Machine Learning model. Companies and Kaggle competitors usually reveal machine learning algorithms used in their solutions, but rarely describe features used to build their models.
I’ve been curating a catalog of feature engineering techniques, that you can find on these slides. You can also take a look at this post series, where I describe how feature engineering can help you be among the top 20 in another Kaggle competition.
Generally, the Root Mean Squared Error (RMSE) metric is used for regression tasks. But as price followed a long-tailed distribution (50% of the products were under $10), in order to make errors on low price product more relevant than for higher prices, the metric chosen for competition evaluation was Root Mean Squared Logarithmic Error (RMSLE). Thus, I applied the log transformation to the price target variable, to make this assumption available for model training.
Dealing with missing information is usually important to maximize results in ML. In this training dataset, more than 632,000 products have no brand information. But, for many of them, the brand name could be found in the product name or description. Thus, a heuristic algorithm was developed in order to guess the “null” brand names from product names description, filtering found brands’ common categories by the actual product category. 87,000 brand names were found by this algorithm!
Product categories were parsed to be incremental. For example, for category “Women/Jewelry/Rings”, three categorical features were generated: “Women”, “Women/Jewelry”, and “Women/Jewelry/Rings”. With those category levels, the model could learn to generalize patterns from top categories and apply it for very specific subcategories, with fewer examples.
Some features were generated based on price statistics grouped by category, brand name and shipping like median, average, coefficient of variation and also expected minimum and maximum value for a given price (average +/- two standard deviations (sigmas)).
Additionally, some features were generated from textual content statistics, like description size, counts of exclamation punctuation, uppercase words, hashtags, and others. It was not clear whether those features did actually correlate with product prices, but they were included so that neural networks could guess whether those features had some predictive power combined with others.
The textual attributes name and item_description were important predictors of product pricing. It may be hard to find patterns in free-text attributes, as users have different writing styles, and may highlight specific characteristics of their products in many ways, using different word sequences, upper-case words and punctuation. Announcements of product packs or sets, as exemplified before, also pose additional challenges.
As usual, competitors’ public kernels provided a great learning resource. This one, especially, leverages the ELI5 framework to analyze a model’s most important textual features, making it possible to debug and improve text tokenization.
The first step in any NLP solution for Machine Learning (ML), is to decide how to structure the text.
A common approach is to use a vectorization technique called TF-IDF, which generates a vector for each text, whose dimension is defined by the vocabulary size of the corpus (e.g., number of words in English language) and larger values means relevant words.
To add some semantics to this representation with small phrases, its common to use bi-grams (e.g., “star wars”, “machine learning”) in the vectorization. This usually increases exponentially the vocabulary size (vectors dimension). To reduce the memory requirements, especially when modeling n-grams, a Hashing Vectorizer can also be used to set a fixed dimension for the vectors, introducing some collision, but reducing memory and computing requirements.
Another alternative is to use a Deep NLP architecture, like Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN), e.g., LSTM or GRU, to learn how to best represent the text in a dense fixed-length vector.
Words are naturally a sparse representation, as in a textual document you usually find only a small portion of the words available in a given language vocabulary. For its usage by neural networks, it is important to have a dense representation for words, named word embeddings. Word embeddings can be pre-trained on millions of documents using unsupervised learning. Techniques like Word2Vec and GloVE have been used to transfer learning of pre-trained word embeddings for smaller dataset models. Such embeddings carry semantics of the words, so that similar words have similar representations (clustered together in the embeddings’ vector space model).
Model Architecture and Training
Due to the runtime constraints for this competition (and the lack of GPUs), I opted for a CNN architecture, for its faster training compared to RNNs. Deep Learning frameworks, like Tensorflow (and lower level NVIDIA cuDNN), have efficient parallelized implementations for CNNs. Here, I will skip explaining how CNNs works for NLP, pointing out to this very didactic post, which inspired my CNN architecture (as well as this kernel).
As text length is an important factor for a CNN architecture processing time, the product name was fixed to the first 20 words, and item description to a maximum of 70 words. For shorter texts, sequences were padded with a special token (<PAD>).Rare words (with less than 30 occurrences in the corpus) were replaced by a <UNK> special token to reduce the vocabulary size.
A GloVE dataset with 400,000 50-dimensional word embeddings, pre-trained in Wikipedia, was used to initialize model embeddings, like described in this post. For Out-Of-Vocabulary (OOV) words, not available in that GloVE dataset, I picked the strategy of random initialization of word embeddings, following GloVE vectors distribution for compatibility.
For categorical features with low cardinality (few unique values), like item_condition and shipping, one-hot encoding transformation was used, whilst for categorical features with high cardinality, like product category and brand_name, random dense embedding vectors were generated.
Finally, for numerical features like item_description statistics and price statistics, standardization feature scaling was used to provide a better representation for neural networks.
The final neural network architecture is presented below.
Numerical features and low cardinality categorical features (condition and shipping) are fed directly in the network. High-cardinality categorical features as passed through an embeddings’ lookup layer, which manages training and retrieval of categorical embeddings.
Textual features (name and description) have their words represented by a shared embeddings’ lookup layer. That sequence of trainable word embeddings, initialized with GloVE embeddings, is processed by a 1D convolution layer (window size of 3, to model tri-grams) with max pooling operation, and generates a fixed-size vector, whose dimensionality if the filter size.
The words embeddings’ Average Pooling layer was an interesting find for this architecture, which reduced model error by 0.003! This approach was devised under the hypothesis that the first words of a product title and description are the most relevant. Thus, this average pooling layer simply averages word embeddings features of the first 5 words of product name, and the first 20 words of its description.
Another useful trick was the usage of skip connections of numerical features and low cardinality categorical features before the last fully connected layer, so that model could use those predictive features directly in the last layer.
Much of the effort in this competition was to empirically tune training hyperparameters (e.g., batch size, learning rate decay, Adam optimizer settings), and evaluate model architectures, combining layers differently and varying embedding and layer sizes.
Layer normalization, dropout, and L2 normalization were experimented in many layers, trying to combat overfitting. Although, I empirically observed that letting the model overfit provided the best evaluation results, since I could train the model only for three epochs under the competition’s 60 minutes runtime constraint.
This full ML pipeline takes 48 minutes in a Kaggle kernel instance. I was possible to reduce the runtime to 38 minutes by reducing the maximum description length (from 70 to 50) and its convolution filter size from (96 to 64), with a very small decrease in model accuracy. That would provide some spare time to ensemble with a quicker ML algorithm, like Ridge Regression (see this awesome example with scikit-learn), and eventually improve the score. But, for this competition, I opted to explore further the possibilities of a single end-to-end Deep Learning model.
After 26 submissions, I was excited to get a single Deep Learning model which, according to my cross-validation score, would place me among the top 2% competitors.
Ps. In fact my model actually scored 0.41117 (computed with Late Submission) and would have landed me in the 35th position (out of 2,384 teams) of the private leaderboard .
Not bad for a single neural network solution, with no ensembling!
But, overnight, the end came…
As that was the first Kernel competition, the Kaggle platform had some issues that made competition deadlines inconsistent across the website and its communications. There was a submission deadline for Phase 1, by which competitors had to choose their best Kernels for Phase 2 (with a larger test set). Many other competitors, myself included, missed that submission date due to the inconsistent progress bar of the competition (see below) and also based on an email sent by Kaggle on the last submission day, informing us that we still had 7 days till the end of the competition.
And that’s how I lost my silver medal after one month of hard work.
Lesson learned: Pay attention to deadlines, specially when inconsistent. :)
The Mercari Price Suggestion Challenge provided tons of learning on both classical and Deep NLP approaches for text representation learning. I was excited to see how Deep Learning could help this solution to do well in the competition, even with just basic feature engineering and no model’s ensembling.
Time and hardware constraints for this Kernel competition leveled competitors’ computing resources and lead us to investigate how to build a more efficient code (faster and with lower memory usage), by using parallelism, smart algorithms and data structure.
I am eager for the next Kernel competition. I believe that soon Kaggle/Google, will provide GPUs to Kernel, as they’re already freely available in Google Colaboratory.
Deep NLP rocks!