Data Cleaning and Preprocessing for Recommender Systems based on NVIDIA’s Use Case

Benedikt Schifferer
NVIDIA Merlin
Published in
7 min readDec 7, 2022

Have you ever heard that 80% of a data scientist’s work is data cleaning? That is probably correct. Data cleaning was really important for our use-case. If you have low quality data as an input, your model (output) will be low quality, as well. After we covered the project goal and challenges in our first blog post, we want to share our learnings for cleaning and preprocessing the data in this one, including how to split the dataset into train and validation.

This blog is part of a series of posts in which we want to share our approach, challenges, decisions and learnings for developing a recommender system to personalize E-Mail campaigns. The content is based on our own internal use-case and we will provide a step-by-step guide on how to build recommender systems from scratch. Readers should be able to take the approach and apply it to their own problems. (Of course, we will not share any user information :) ).

Data Privacy: Anonymization

The first step was to anonymize all data sources by removing any personal identifiable information. Users are represented as unique IDs, which can be mapped only in the E-Mail campaign tool. The process guarantees that user information are secured.

How to evaluate a model: The Train-Validation Split

The next step is to decide how to split the dataset into training and validation dataset. There are many ways to split the dataset, such as random, random by users, random by item. This is an important decision because it will determines the evaluation process of the model. In the previous blog post, we explained that our use-case has an extreme cold-start problem: Every item is new in the prediction period and a significant number of users are new, as well.

We want a realistic evaluation process, which mirrors our production environment. We decided to split the data by time as visualized below: The training datasets are GTC Spring 2020 — GTC Fall 2021 and we evaluate the model based on GTC Spring 2022. The trained model will generate predictions for GTC Fall 2022.

This evaluation process is closest to the production environment, where all items are new. The ratio of new users will be consistent for GTC Spring 2022, as well. It is important to split the dataset in a way that there is no data leakage. In contrast, if we split the full dataset randomly, then we have user behavior for items in the validation dataset.

Let’s take a look how data scientist spend most of their time: Data Cleaning

Normally, the data is not organized in a well structured, single table format. In reality, the data is split across multiple tables and sometimes across multiple systems. The systems are often designed for another purpose (e.g. enabling users to login, see watch history, checkpoints timestamp of watched videos) and they aren’t designed to capture data for machine learning projects. Therefore, we as data scientists are grateful for having data and need to be patient with the quality :).

In our case, the data was organized by multiple tables as visualized below. One attendee can register to multiple GTCs and can attend multiple talks. Each talk can be attended by multiple attendees and can be associated with only one GTC.

The first step is to explore the dataset and validate the correctness. We generated some statistics, such as:

  • Number of unique keys per table (e.g. unique attendees in attendee)
  • Number of duplicated keys per table
  • % of matches between two tables (e.g. can we link all gtc_talks_catalog keys in attendee_attendance to gtc_talks_catalog
  • Number of gtc_talks_catalog per GTC Event
  • Number of attendee_attendance per GTC Event
  • Number of watched GTC talks per attendee etc.

We collected the information and validated it with the related departments to ensure that the data is complete and correct. Otherwise, we could spend months modeling based on wrong data. In these cross-checks, we noticed that we had to deduplicate some tables.

Feature Validation and Selection

So far, we validated the association between the tables. Now, we can take a look at the actual attributes (features) in each table. The goal is to evaluate if the feature can be used for modeling. For example, there were around ~30 different attributes for an attendee, but we used only ~10 attributes in our model. Our process was:

  • Calculate the % of NaN per features (how many rows do miss the feature)
  • Calculate the % of NaN per GTC event per features
  • Is the feature a categorical attribute (e.g. country) or numeric attribute (e.g. video length)?
  • For each categorical attribute:
  • What is the distribution of categories?
  • How many unique values?
  • For each numerical attribute:
  • What is the distribution?

We can share some of our real-world examples:

  • Many attributes were empty (high % of NaN values)
  • Some attributes were collected in the past (e.g. in 2020) but they are not collected anymore as the registration forms were updated
  • Some attributes were duplicated — there were multiple columns for country of an attendee

In this way, we could reduce the dataset size by exporting only the relevant attributes per table and focus on the important information.

More Data Cleaning

There was a 3rd loop of data validation. Although an attribute seems to be valid given the checks above, it might require additional manual preprocessing. When we deep dived on the attributes per GTC, we noticed that the categories changed between different GTC. The registration forms were updated and the naming conventions were not consistent. The table below is an example for job roles. In some cases, the examples differ by a space character. But sometimes, a different word was used.

The job role example was easy to manage as we use only ~20 different descriptions. It was more difficult for countries with 200 values. We calculated the frequency per value and sorted them to prioritize the most important one.

Don’t Forget Production Data

We cleaned all the data and developed our training pipeline in advance to our production event, GTC Fall 2022 and therefore, we hadn’t had access to the data structure of it. Although we saw a trend of changing data between events, we focused on the modeling part and revisited the data validation process late in the project. We faced multiple issues:

  1. Some data are not longer captured in our prediction time period
  2. Some categorical values changed in the prediction time period

There were only a few features, which we had to remove. It did not impact the model accuracy, however, it required re-training and evaluating the model. However, the 2nd case can have tremendous impact. Our data processing pipeline mapped all unknown values to the value “Other”. Although the values are relevant and meaningful, we lose the information due to the wrong assignment. The wrong assignment happened only for the test period, therefore, our evaluation metrics were not impacted by it. This error did not break our pipeline and could have been deployed undetected to production. Luckily, we calculated feature distribution for training, validation and test dataset and saw that there was a shift towards more “Other” values. After spotting the error, it was easy to fix and add the correct mapping.

Key Learnings and Summary

In the second blog post in the series, we shared our learnings from data processing. A quick summary:

1. Train-Validation Split is important and should reflect the problem. A split by time is often a good starting point.

2. Validate your data: If you have low quality data as an input, your model (output) will be low quality, as well.

2a) Validate how to merge data by calculating cross checks

2b) Validate each feature by calculating cross checks and distributions

2c) Don’t forget your production data

Next, we will share the details of the model training with a two-tower architecture. If you want to stay uptodate, follow us on medium.com or github.com.

In the meantime, you can checkout our repositories for recommender systems. NVTabular is a library for GPU-accelerated data processing, Merlin Models is a library for common recommender systems models/architectures and Transformer4Rec is a dedicated library for transformer-based architectures for recommender systems.

Team

Thanks to the great team developing the in-house use-case: Angel Martinez, Pavel Klemenkov, Benedikt Schifferer

Image Credits

https://ccnull.de/foto/wireless-upright-vacuum-cleaner-in-man-hand-house-cleaning-equipment-tool/1085388 https://www.flickr.com/photos/30478819@N08/45944768705

--

--

Benedikt Schifferer
NVIDIA Merlin

Benedikt Schifferer is a Deep Learning Engineer at NVIDIA working on recommender systems. Prior, he graduated as MSc. Data Science from Columbia University