Group-by data augmentation for e-commerce datasets

Introducing a new training strategy for learning user preferences to multi-attribute offers based on a multi-task approach

Artem Kozhevnikov

Published in

Tinyclues Vision

15 min readMar 24, 2023

Making sense of various groups of offer attributes

Code is available on https://github.com/tinyclues/group-by-augmentations-model

Motivation

In tinyclues SaaS business context, we want to model a user propensity to buy a given offer in the presence of its multiple offer attributes. In our previous medium blog, we explained that a core tinyclues platform functionality is to provide marketers with an extremely flexible ML tool for creating targeted marketing campaigns.

In particular, the definition of an “offer” is very broad and may include different product categories, attributes, and transaction contexts. It can even work in a case where no product catalog exists at all.

To solve this challenge, we considered several machine-learning approaches. In this blog, we’ll dive deep into a complete implementation of one of the previously discussed ML designs. Mainly we’ll talk about the offer aggregation strategy internal to the model. We’ll also share some insights about model architecture and discuss various results. So, let’s start!

Main idea

Our guiding intuition is to represent an offer as a bag of features composing it. What does this mean concretely? Take an offer defined as a SQL filter over a transactions (user-offer interaction) table. So, when we apply that filter, we get all corresponding offer rows (instances). Next, for a given set of offer features, we capture the distribution of each offer feature (independently of others).

In e-commerce datasets, offer features are typically categorical and encoded as 1-hot labels. Thus, the resulting bags will be weighted multi-hot features encoded as sparse vectors. Note that we may optionally modify the relative row weights by giving more importance to recent events. At inference time, a trained model receives those multi-hot bags-of-attributes features as input (along with other features) to generate a prediction for a selected offer.

However, it turns out that is not enough yet! We need to modify the training schema as well to get accurate predictions when using this offer aggregation (because otherwise, the model performance may drop a lot.) Now we’ll explain how precisely to do that!

**left:** A dummy example of offer attributes on transactional table || **right:** Bags representation of some offers generated from that table. (Offer definition attributes are shown in red. Here, “pid” is a short for “product_id”.)

Data augmentation ingredients

Keeping the main ideas in mind, our main logic becomes quite straightforward: we want to mimic the inference pattern of feature transformation during the model training as well. Mainly, we should replace original non-aggregated offer features with their bags-aggregated analogs for all training batches. In machine learning literature, this process is called data augmentation, and we’ll now take a closer look at it.

Results of Group-By (with “key” column) on categorical (”brand” and “size” that are labeled encoded) and dense features (”emb”) followed by **mean** aggregation

Mini-batch approximation of offer aggregation

Note that for the inference, we can access offer bag statistics from the full (or sufficiently well-sampled) transaction data. Yet, for neural network training, having a mini-batch-based approximation can be convenient for building augmentations on the fly.

Fortunately, for the training of our models, we already used a batch size that is large enough (~10k), and we did not notice any major drawbacks of using the mini-batch stats in our experiments (even if, in theory, such an approximation adds more randomness and may not be accurate for rare offers).

Ok, how to generate a mini-batch augmentation? Let’s pick some random offer attributes (it can be one or many), let’s say brand and size. The obtained tuple (brand, size) is called a key, and a component performing this choice is called KeyGenerator. Next, for any row i ∈ MiniBatch, we modify the value of an offer feature F by a bag of values taken from all mini-batch rows sharing the same key. The resulting feature Grp(F) will be of variable length if F is categorical. Since this is similar to a common SQL operation, we’ll also baptize it group-by :

For a given mini-batch we generate nb_augmentations (typically, 5 …10) group-by augmented batches.

# input: batch (dict: feature_name -> tensor)
# params: offer_features, average_key_length, nb_augmentations
# output: [batch_a1, batch_a2, ... , batch_an] (augmented batches)

def KeyGenerator(batch, offer_features, average_key_length):
    """
    Generate group-by key from offer attributes
    """
  ...

def GroupBy(feature_tensor, group_by_key):  # in TF 2.*
    unique_values, unique_idx = tf.unique(group_by_key)  # unique keys positions
    grouped_ragged = tf.ragged.stack_dynamic_partitions(feature_tensor, unique_idx, len(unique_values))
    return tf.gather(grouped_ragged, unique_idx)  # broadcast to batch size -> a ragged tensor of shape bs x None

output = []
for i in range(nb_augmentations):
    augmented_batch = batch.copy()
    group_key = KeyGenerator(batch, offer_features, average_key_length)  # shape bs x 1
    for feature in offer_features:  # we don't modify other features
        augmented_batch[f'{feature}_grp'] = GroupBy(batch[feature], group_key)
    output.append(augmented_batch)

KeyGenerator

So, what about the behavior of KeyGenerator? Many possible key exploration strategies could be used. For instance, it is reasonable to select the offer keys which are more frequently used by our clients on the platform with a higher probability. Yet, we went with a very simple and usage-agnostic approach: each offer feature can be selected randomly and independently of others with some fixed probability proba = average_key_length / number_of_offer_features, which produces a binomial distribution. We typically set average_key_length ~ 2. Once key features are selected, we define a “key” column as a tuple of key features values (or by hashing them), and next, we perform an offer features group-by based on it. If an offer feature is a multi-hot feature, we apply MinHash on it (with random seeding for each batch) to get back to the 1-hot case.

Offer mixture

A basic KeyGenerator above corresponds to the AND conjunction between several attributes in the offer SQL filter. To support the OR disjunction case (as pid=”942” OR pid=”661” in the example above), we need to perform an extra offer mixture. There are many ways to do so, but we consider here a simple method that preserves a mini-batch logic. In particular, we split the biggest groups obtained above into smaller sub-groups, which we then collision randomly to create a new key column. We do it in such a way that collided groups contain from 2 to 6 “pure” sub-groups. That simulates the sparse mixtures of various offers with random weights. In the process, we apply those collisions only to a random fraction (like ~50%) of generated keys.

Note that it is very similar to generating a random linear combination of sample representations in classical semi-supervised methods such as MixUp.

Multi-task

Now that we have found how to modify batches to represent any offer, we’ll feed those batches into the original model. Of course, one should also ensure that a model handles ragged inputs correctly. A loss per step can be defined merely as a sum of the respective losses. And in fact, we may think about it as if it were just a multi-task learning process where each choice of grouping keys corresponds to a different task. And that’s an important point since it opens the room to various multi-task optimization methods.

loss = 0
for i in range(nb_augmentations)
  output_i = Model(augm_batch_i)
  loss += LossFn(output_i, response)

Note that it is worth creating several augmentations from one batch of data. It would allow to average gradients of different multi-tasks at each learning step, making it more stable. It also speeds up the training itself. Indeed, grouped-by batches can be processed in parallel by a model while user features embeddings (or other model nodes) can be shared across all augmentations.

Group-by with fused aggregation

In practice, we do a group-by aggregation in a more optimized way. We first apply an embedding layer on offer feature inputs (only once per batch) and then perform a data augmentation fused with a vectorial aggregation (like mean on “emb” column in the example above) in one single operation on the offer embeddings. By doing so, we can share offer embeddings computation across different batch augmentations and thus be more memory efficient.

Of course, this trick may not work for more complex aggregations. Technically speaking, thanks to on-the-fly group-by with fused aggregation, we could to implement everything inside the model itself without modifying our training data-loaders at all. So, let’s now look at what happens to the model!

Elements of model design

Group-by data augmentation can be combined with many possible ML/DL models. And the most noticeable performance improvement will come from the group-by itself. However, choosing the right model can also be very important. We thus want to share some principles we considered for our model architecture choice. From our experiences, the latter is typically responsible for around 2%-6% of a relative metric improvement compared to a simple two-towers model (in notebooks, we’ll provide some ablation tests).

Mean and variance extraction for internal feature importance

To capture the main offer semantics, it is natural to extract a mean μ of embeddings within groups. Yet besides μ, we also extract the embeddings’ intra-group variance σ. On many datasets, we observed clear benefits from using σ in our models, and the intuition behind it is the following: higher variance σ indicates the potential presence of noisy components among offer embeddings.

Therefore, thanks to σ, a model can reduce their relative importance (one may think about μ / √(1 + σ) signal over noise kind of formula), whereas with only μ, a model should learn to sum up useless offer components to zero (neutral vector) which becomes much harder, thus making the model less robust to the generalization on unseen offers.

For instance, to represent a given brand, a model would rather rely on the brand feature embedding (whose σ is zero) and may completely ignore the component of size embeddings (whose σ is typically high because there is a variety of sizes within any brand).

Instead of taking a precise formula of σ action on μ, we decided to learn it. Hence, we introduced an additional sub-network MaskNet, acting on μ as pointwise multiplication :

We parametrize MaskNet as a sequence of non-linear dense layers (DenseNetwork or DN) with some sparse (k-WTA) connections suitable for multi-task learning.

Note that one can also extract other statistics from group embeddings as soon as they are consistent, i.e., converge to the ones over the entire dataset when a batch size grows. And, to follow a more modern idea, one could use a self-attention type of model as an aggregator instead of deterministically extracted statistics.

Compressed feature-wise interactions

Another important principle we identified in our work: each offer (and user) feature should possess its own single embedding space. This is important for sharing the maximum amount of information while reducing the number of model parameters.

However, this comes at the price of a more complex interaction module. For instance, it can be harmful to go with a simple sum of different feature embeddings (like size and brand). Indeed, suppose the size feature provides a very strong interaction. In that case, the mixed embedding space will be completely driven by size semantics and can be too constraining for a brand representation.

This reasoning naturally brings us to a feature-wise bi-linear interaction model which captures all individual user-offer feature interactions and allows the coexistence of different embedding spaces. Mainly, for all pairs of user u_i and offer o_j features, we extract a scalar interaction.

where the kernels

It is an adequate choice when there are few interactions. But it becomes impractical (due to a higher number of parameters and expensive computational time) when the number of offer and user features grows. To solve this issue, we suggest the following tradeoff: before computing the feature-wise interactions, we combine some offer features into a “meta”-feature.

Thus, this piece of the model looks (in einsum tensor notations) as follows,

where b stands for batch, o for offer features, m for meta offer features, and d for embedding dimensions, respectively. The feature compression matrix ℂ := ℂ(μ, σ) is (instance-wise) parametrized by a DN, and MaskNet is now acting directly on the meta-features.

We perform a similar meta-compression on user features as well.

We argue that this feature compression is a valid choice for the groups of hierarchical features (like for ‘cat1’, ‘cat2’, ‘pid’ sharing a compatible semantic), and as a usual number of hierarchical groups within offer features is rather low, we keep the number meta features rather lower (≤ 5). In some situations, it could be useful to require a more sparse structure from ℂ (for a stronger feature selection effect).

We finally apply a DN again (with gelu or tanh nonlinear activation) over extracted meta features interactions to get scalar output. For more details (technical implementations, model hyper-parameters choices, evaluations, and so on), we invite the reader to look at these notebooks, where we reproduce our models on publicly available Movielens and Rees datasets.

Training time

Thanks to fused aggregation and suitable training params choice (we typically double the number of epochs for group-by training), we found that the overall training time increase due to group-by augmentation is rather acceptable (at worst-case scenario ≤ x3) compared to a large number of combinations of offer attributes the model learns.

Results

We’ll present several results of our model with group-by data augmentation on the internally available datasets. Group-by models are now used in production on the majority of our datasets (on ≥ 100 of them) but due to a lack of space, we’ll only report the most typical and illustrative cases here. We also provided the notebooks that allowed us to reproduce these results on two public datasets.

Evaluation

Our cross-validation protocol consists of the models training on historical events (around one year of data) and the performance evaluation over two weeks post-training time. Let’s know the metric we look at. For a given grouping key (let’s say “brand” above), we focus first on the corresponding most popular offers (i.e., queries like (”brand” = X) ). For any such offer we evaluate the model AUC on the offer binary classification task (1 if the event belongs to that offer and 0 otherwise).

Finally, we average the computed AUCs (with weight = offer frequency) to get wAUC — the metric we’ll follow. We report wAUC for a subset of available offer keys. While looking at the top offers is more appropriate for group-by effect (since all groups will contain a lot of elements), for cold start benchmarks, we are taking less frequent offers whose number of occurrences is ~ 20–200 in the evaluation window.

Multi-task model vs. specialized mono-task models

To challenge our multi-task approach, for any chosen offer attribute, we train a mono-task model which is restricted to the corresponding offer feature. These specialized models are trained with no augmentation but share the same architecture (except using σ, which is zeros in that case).

Here we report models wAUC performance for various offer attributes (taking offers with ≥ 200 occurrences):

**left**: **Retail A** (Classical retail with atypical size feature) || **right**: **Retail B** (Classical retail with strong geographical shop_id feature)

**left**: **Travel** (Travel industry standard case) || **right:** public **Rees** retail dataset

Let’s take a closer look at the results. We note several important observations :

First of all, as expected, mono task models show strong results when evaluating on their corresponding task but typically underperform on the others.

GROUP_BY model provides decent results across ALL tasks with few cases where metrics are slightly lower than the specialized counterpart model. And even if it is not a winner everywhere, a group-by model shows clear multi-task capabilities !

Model doing offer group-by only in inference with only classical training (without augmentations) has a significant performance drop for most tasks.

Further analysis and future work

We see that group-by trained models perform very well compared with specialized models, and that’s precisely the main effect we wanted to create! But beyond that, we’ll show you below some other interesting results that came along the way.

Cold start scenario

When looking at wAUC of rare offers (with a number of occurrences between 20 and 200), we see that our multi-task group-by model starts to show even better results than specialized mono-task models, which proves that group-by models can effectively leverage the information of all available offer features in cold-start scenario. Note also that group-by model leverages it much better than without augmentations model which has access to all offer features as well.

Cold start results. **left** : Retail A || **center** : Retail B || **right** : public Rees retail dataset

Offer mixture experiments

To validate the importance of offers mixture in group-by data augmentation, we generate an artificial “mixed” topics by sampling randomly a λ fraction of rows from one offer and (1 — λ) fraction from the second offer (λ = 0.5 corresponds to ORquery between of the two offers).

We vary λ ∈ [0, 1] and follow the AUC on the λ-mixed offer of two models: one trained with only pure groups and another with an extra group-mixture. What we note is that for the extreme λ values, both models perform similarly. However, the collision stimulated model shows much better result in the middle; and if two offers are more “dissimilar,” one should expect a bigger gap.

On the plot below we show some typical behavior :

**left:** AUC on mixture (with param lambda) of two offers for two models : one trained with keys collisions (AND_OR) and the other not (only_AND) || **right:** Same for another offers choice

Negative transfer

We see that sometimes, the learning of some offer attributes may result in a performance drop for others (hence, we are talking about conflicting attributes, like shop_id and product_id on Dataset B). This phenomenon, called negative transfer, is well-known and actively studied in the multi-task world.

To mitigate it, a variety of techniques can be applied. Here is a list of several ideas for future work :

A guided (aka curriculum learning) keys exploration;
Exploration that avoids taking the same or similar keys but rather tries to create complementary offer keys (based on their mutual information) for each batch;
Focal loss (we applied it with success on datasets with an important disparity of offer features interaction strength);
A gradient surgery for conflicting tasks and more multi-task-friendly model architecture.

To illustrate the idea of curriculum learning, we observe in the example below that during the training “shop_id” attribute (which is easy to learn thanks to the “zipcode” user feature) reaches its top wAUC quickly whilst it takes more epochs for the offer attributes (item dependent). This is rather a common pattern, and therefore, it is worth making KeyGenerator more focused on the “hard” keys after the first epochs.

wAUC evolution during training epochs on **dataset B**

Feature representation quality

Another noticeable effect of group-by training is the improvement of the quality of both user and offer embeddings. We observed a better performance of user embeddings transferred to other downstream tasks. Thus, group-by data augmentation can be used as an efficient pre-training strategy or as an auxiliary task for semi-supervised training.

Group-by trained offer embeddings (more precisely, offer meta features) used for the (cos-)similarity search also showed more relevant results. Moreover, the overall model diversity (meaning the capacity to select different audiences for different offers while keeping the same prediction accuracy) has been increased as well (especially for lower entropy categories such as cat1, cat2, or brand).

Offer meta features CosSimilarity for top-40 directors on Movielens dataset. Comparison of two models. **left : group-by** model || **right :** model **without group-by**

The main intuition is as follows: the group-by training forces a model to make sense of all offer embeddings individually and their combinations, whereas with no data augmentation, the lower entropy features serve mainly as compensation for poorly sampled ones of higher entropy. Note that to some extent, a similar effect is reached with a classical masking/dropping feature type of data augmentation (and the group-by can be successfully combined with the latter methods!).

Conclusion

To wrap it up, we’ve suggested a new type of data augmentation strategy called group-by. This strategy allows us to learn in a multi-task fashion user preferences from all combinations of offer attributes producing high-capacity models. And that’s with an acceptable negative transfer and a limited increase in the training time! Beyond that group-by trained models prove to be powerful for improving the embeddings quality and tackling the cold-start problem.

Last but not least, we believe that group-by data augmentation can find other applications for tabular datasets and is not limited to learning only offer-user interactions.