Dealing with large vocabularies in Recommender Systems

Dinesh Ramasamy
3 min readAug 13, 2023

--

Most models use embedding tables to represent categorical features as vectors. Recommender Systems are no different: They need these tables to convert categorical features into vectors. But recommender systems need to operate with categorical features of very large vocabularies like user and item ID features. One of the most well-known ways to handle such large vocabularies is to train models with large embedding tables. This inflates the model size and demands a lot of work on making training and serving infrastructure work.

In this post, we describe how one can stay away from large RecSys models without compromising on the resulting model quality. We start by hashing IDs in an almost collision-free manner (Note that we may want all features to live in the same space. So we use feature names as seed to distinguish between two different features taking the same value).

def hash_feature(feature_value: str, feature_name: str):
seed = xxhash.xxh32(feature_name, 0).intdigest()
return xxhash.xxh64(feature_value, seed).intdigest() - 2 ** 63

We can then map these to embeddings using a different embedding tables or even a single embedding table (as the feature names have been used as seeds in the hash function) using techniques like QR, ROBE, etc., We prefer the following generalization of QR using bit-shift operations which is loosely inspired by this follow-up paper from the authors of ROBE which we call K-Shift embedding:

class KShiftEmbedding(nn.Module):
def __init__(
self,
num_embeddings: int,
emb_dim: int,
num_shifts: int = 8,
normalize_output: bool = True,
sparse: bool = False,
):
"""
:param num_embeddings: Number of rows in the embedding table
:param emb_dim: dimensionality of the embedding table
:param num_shifts: Number of random access ops (k-shift ops / bit shift ops) => 2 shifts is comparable to
QR embedding
:param normalize_output: whether to normalize the output embeddings
:param sparse: whether to use sparse gradients or not
"""
super().__init__()
self.emb = nn.Embedding(num_embeddings, emb_dim, sparse=sparse)
self._num_embeddings = num_embeddings
self._num_shifts = num_shifts
self._num_bits = 64
self._normalize_output = normalize_output

def forward(self, id_: torch.LongTensor) -> torch.Tensor:
"""
:param id_: any long in -2 ^ 63 to 2 ^ 63 - 1
:return: embedding corresponding to the long
"""
tensors = []
for col_idx in range(self._num_shifts):
idx = self.get_row_idx(id_, col_idx)
tensors.append(self.emb(idx))
x = torch.stack(tensors, dim=-1).sum(dim=-1)
if self._normalize_output:
x = F.normalize(x, p=2.0, dim=-1)
else:
x = x / math.sqrt(self._num_shifts)
return x

# Some op to give pseudo-random multiple access to the embedding table
def get_row_idx(self, x: torch.LongTensor, col_idx: int):
if col_idx != 0:
# closely mirrors bit rotation
x = (x << col_idx) | (x >> (self._num_bits - col_idx))
return torch.remainder(x, self._num_embeddings)

The embeddings thus derived can then be consumed by a model in the same form as they were before.

We can also use this module to memorize hashed ID to content embedding mappings. This allows us to keep an ID to content embedding mapping inside each model so that feature pipelines are not overloaded with content embeddings (which might change). More precisely we learn a model in the following way:

# content_embedding_df, columns hashed_id, embedding
expansion_factor = 1.15
num_shifts = 16
batch_size = 2 ** 18
emb_dim = 32
epochs = 500

x, y = torch.from_numpy(content_embedding_df['hashed_id'].values).unsqueeze(1),
F.normalize(torch.from_numpy(np.stack(content_embedding_df['embedding'].values, axis=0)), p=2.0, dim=-1)
dataset = torch.utils.data.TensorDataset(x, y)
dataloader = torch.utils.data.DataLoader(
dataset,
batch_size=batch_size,
shuffle=True,
)

model = nn.Sequential(
KShiftEmbedding(
int(expansion_factor * x.size(0)),
emb_dim,
num_shifts=num_shifts,
normalize_output=True,
),
nn.Linear(emb_dim, y.size(1), bias=False),
)
optim = torch.optim.Adagrad(
model.parameters(),
lr=5e-1,
)
criterion = nn.MSELoss()

for epoch in range(epochs):
for input_this, target_this in dataloader:
pred_this = F.normalize(model(input_this).squeeze(1), p=2.0, dim=-1)
loss = criterion(pred_this, target_this)
optim.zero_grad()
loss.backward()
optim.step()

Its worth noting that we can “compress” embedding tables by setting expansion_factor < 1, though we have not studied the space-savings versus accuracy tradeoff carefully. We can also compress in the PCA sense by setting emb_dim < y.size(1), which we leave for further study as well.

We can also use an MLP cascaded on top of a smaller dimensional KShiftEmbedding model (say 4-dim) to learn OOV mappings using random IDs as negatives.

--

--