Predicting package dimensions based on a similarity model at Mercado Libre

Kevin Clemoveki
Mercado Libre Tech
Published in
6 min readJun 2, 2020

Introduction

Mercado Libre is the leading e-commerce platform in Latin America reaching out to millions of users who sell and buy millions of different items per day. From a shipping perspective, some of the most important information related to an item is its dimensions (weight, length, width and height), used to predict costs and forecast the occupancy of our fulfillment centers. Since it is a user-driven market, this information is not always available and we have found that it is possible to optimize our flow by predicting it ahead of time.

Taking into account that the shipping cost of the packages corresponding to Mercado Libre sales is calculated based on their dimensions, it is necessary to have a model (or more) that allows us to obtain a prediction for the weight, length, width and height of each product to be shipped. The more accurate these estimates, the better the shipping costs forecast will be, which in turn creates a better experience for sellers and buyers. Moreover, this minimizes the costs of Mercado Envíos (shipments) when it comes to taking on faulty estimates.

In this article, we will present one of the various models developed in the company to solve this issue (weight and package measures prediction), based on the search for near neighbors, i.e. using data from shipments of similar products.

On what grounds do we determine product similarity?

According to the metadata defining an item as such, we have considered the criteria of similarity in title, category, brand and model.

Data and Features

The process of feature selection/extraction and normalization of the corresponding dimensions (targets) was based on the definition of an appropriate ETL for the problem.

Step 1

As the information of item dimensions and metadata is found in different sources, the first step is to match the metadata of each item with the data (dimensions) corresponding to all of its shipments.

Step 2

Next, this large amount of data becomes the source for obtaining the training, development and test datasets, based on a random distribution of the items and making sure that there are no items from the development or test datasets on the training one.

Step 3

Once these datasets are obtained, the same set of transformations is applied to each, both to normalize and sanitize text features (title, category, brand and model) as well as to normalize dimensions since in the step prior to the transformations, each item can have a different set of dimensions (weight, height, length, width) according to those reported in each of its shipments. Therefore, it is necessary to assign to each item a single set of dimensions, obtained in this case by estimating their respective medians so as to eliminate outliers.

Why do we split datasets before applying the transformations common to each of them?

From a business perspective, it is important to know what shipment coverage is affected by the estimates generated by our model. It is thus necessary to have these raw datasets to calculate such coverage since once the transformations are applied, each metric the model is evaluated against — in this case the coverage — is a metric belonging to the universe of items and not to the universe of shipments per item.

Model

The model respects the structure of a pipeline where different NLP techniques are implemented for text preprocessing and vectorization, as well as a new regression approach using K approximate Nearest Neighbors.

The following figure shows the architecture and the technological stack used:

Input

We can differentiate the input into two parts for a better understanding of the two most important model flows: training and prediction.

  1. X: Set of features per item.

In this case, there is only a single text feature per item that contains the concatenation of the following features:

  • title
  • category
  • brand
  • model

Then, X is as follows:

2. y: Set of dimensions per item.

  • weight
  • height
  • width
  • length

Then, y is as follows:

Text Preprocessor

The objective of this step is to normalize and sanitize the text generated for each item through small textual transformations such as:

  • Strip Diacritical Marks
  • Tokenizer
  • Spell Checker
  • Lower Case

with the aim of generating a noise-free language model in the next step.

Vectorizer

In this step, we expect to build a language model through FastText using all the previously preprocessed items. This will later enable us to generate a vector representation of each input, i.e. a representative embedding of the items in question, both in the training and prediction flows.

KNN Regressor

KNN Regressor is the last step of the model. It is in charge of storing the dimensions of each item used as training and of providing an estimate of the dimensions for any item, whether new or pre-existing, based on the similarity that can be found in the items stored.

How is the training process at this stage?

Firstly, all the items that at this stage are represented by a vector — resulting from the transformation process in the previous step — are stored in an Annoy index in order to optimize the approximate search of the nearest neighbors and thus find similar articles.

In addition, the respective dimensions of each item are saved in another storage, which will allow us to later retrieve the dimensions of the similar items and generate a new estimate.

How is the prediction process at this stage?

The item with a vector representation consults the previously populated Annoy index to obtain the K nearest neighbors with their respective distances. Then, this set of candidates is reduced by applying a distance filter in accordance with a given threshold T, where all the candidates with a distance greater than such threshold are discarded.

The candidates resulting from the applied distance filter are taken as items similar to the query item. Their respective dimensions stored in the dimension storage are used to obtain a single estimate using the median of each dimension.

Output

It is the set of predictions estimated by the model, which consists of lists of dimensions for each item. See definition of y.

Example

To illustrate how the model works, the prediction process performed for item MLA699827356 is shown below:

First, the model analyzes the similar items. The following are the items rendered by the model with the similarity distance filter applied:

Then, the model obtains the dimensions already known for each similar item from the storage of dimensions:

Finally, the median of each of the dimensions is calculated to obtain the final estimate of the objective item MLA699827356:

Serving model

The model is currently being used in the product publication page and the fulfillment center packaging calculator to forecast the occupancy of the shelves and the right envelope to use at the final product packaging stage.

The API currently serves 20k predictions a minute or 29M a day. The mean inference time for our model is 5ms and we serve it on a 32GB RAM + 8 cores instance.

Here is the JSON response for the article MLA699827356

{
"dimensions": {
"height": 29.3,
"length": 10,
"weight": 2802.5,
"width": 23.25
},
"source": {
"identifier": "MLA699827356",
"origin": "similarity"
}
}

Acknowledgments

--

--