My main takeaways from ICLR 2024

alain rakotomamonjy
Criteo Tech Blog
Published in
5 min readMay 23, 2024

I recently had the opportunity to attend the International Conference on Learning Representations (ICLR) in Vienna, Austria. The ICLR conference is the premier gathering of researchers dedicated to the advancement of representation learning, also known as deep learning. The conference, hosted in the beautiful city of Vienna showcased the latest breakthroughs in many exciting and rapidly evolving subfields including: Large Language Models (LLMs), diffusion models, federated learning, and reinforcement learning.

The trends that I have noted is that there is a large effort devoted to LLMs for their evaluations, their fine-tunings, for improving their memory consumption and time complexity. Another very hot topic is diffusion models. Diffusion models are the state-of-the-art methods for generative AI, especially for text2image generation. Several works have focused on extending their applicability to other structures of data, other types of problems and more interestingly several theoretical works have tried to improve our understandings of these models. See

Main Highlights

Among all the paper presentations that I have attended, I would like to highlight the following ones, which actually two of them won the outstanding and honorable best paper awards:

Model Tells You What to Discard: Adaptive Key-Value Cache Compression for LLMs

https://openreview.net/forum?id=uNrFpDPMyo

The paper introduces a so-called adaptive KV cache compression, a method that reduces the memory footprint of generative inference for Large Language Models (LLMs). The conventional KV cache retains key and value vectors for all context tokens, but this method conducts targeted profiling to discern the intrinsic structure of attention modules. The lightweight attention profiling used to guide the construction of the adaptive KV cache allows for deployment without resource-intensive fine-tuning or re-training. In experiments, the method, called FastGen, demonstrates substantial reduction in GPU memory consumption with negligible generation quality loss.

This paper can be of strong interest to Criteo when serving LLMs on-the-premises, as it provides a way to reduce the memory footprint of generative inference for LLMs. Beyond memory, it can substantially reduce GPU memory consumption with negligible generation quality loss.

LabelDP-Pro: Learning with Label Differential Privacy via Projections

https://openreview.net/forum?id=JnYaF3vv3G

Label differential privacy is one of the key mechanisms envisioned by the Google Chrome team for introducing privacy into the computational advertising system. This will require new approaches to learning our models, which is the subject of this interesting paper. As such, and because this paper was written by Google Research, it is worth looking at.

The paper introduces a new approach to label differentially private (label DP) algorithms, which aim to protect the privacy of labels in a training dataset. Unlike previous label DP algorithms that rely on label randomization, this new algorithm takes advantage of the central model of differential privacy.
Privacy is brought by adding noise to the gradient of the loss function and the key contribution is to denoise the gradient by projecting it onto a low-dimensional subspace which provably contains the true gradient. Globally, the algorithm combines gradient projection operations with private stochastic gradient descent steps to enhance the effectiveness of the trained model while ensuring the privacy of the labels. Overall, this research presents a new promising direction for label DP training algorithms.

Generalization in diffusion models arises from geometry-adaptive harmonic representations

https://openreview.net/forum?id=ANvmVS2Yr0

The last paper I would like to highlight is a paper on diffusion models, and it is more about understanding how diffusion models work and what are their inductive biases. What I really like about this work is that it is reminiscent of old good theoretical papers from the nineties about wavelets and denoising.
The paper first shows that when the number of training images is large enough, two DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function and, thus, the same density. In this regime of strong generalization, diffusion-generated images are distinct from the training set. They are of high visual quality, suggesting that the inductive biases of the DNNs are well-aligned with the data density. The interesting point is that the authors analyze the learned denoising functions and show that the inductive biases give rise to a shrinkage operation on a basis adapted to the underlying image. Examination of these bases reveals oscillating harmonic structures along contours and in homogeneous regions. The paper concludes that the inductive biases of diffusion models are geometry-adaptive harmonic representations, which are well-suited to capturing the structure of natural images.
While this paper has few direct applications to Criteo projects in the near future, it provides a very interesting theoretical insights on diffusion models and how they work and then how they can potentially impact the way we do generative AI.

My paper presentation

From my side, I was pleased to present a paper I co-wrote with Liva Ralaivola and Kimia Nadjahi. In this work, we addressed the problem of computing the similarity between two sets of embeddings stored on two different clients while ensuring that the set of embeddings is not exchanged between clients. For instance, under some privacy constraints, one company may want to measure how the user embeddings of one of its clients are similar to those of another client, and we would be able to do so by solving this problem. Solving this problem will also allow us to improve federated learning algorithms by being able to identify dataset similarities without sharing data with a global server.

Federated Wasserstein Distance

Our work has been well received by the community, and we have received very positive feedback from people who attended our poster session.

We are ready now to focus on our next works addressing label differential privacy and to prepare for other paper submissions, including Neurips or ICLR 2025, that will happen in Singapore.

--

--