Diffusion Policy Explained

8 min readAug 19, 2024

This is a detailed (?) breakdown of the paper (and consecutive Journal) “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion”. [1][2]

TL;DR

Diffusion Policy is the new kid on the block when it comes to Imitation Learning (IL), based on the Denoising Diffusion Probabilistic Models (DDPM). Modeling the policy as a DDPM allows it to capture multiple modes in the action space(i.e., it can account for the different ways a task can be performed as demonstrated by various users). This is a capability that most state-of-the-art IL algorithms lack. Diffusion Policy achieves an average improvement of 46% with a smaller number of demonstrations and in a variety of simulated and real-world benchmarks.

A bit of Background on Diffusion Models

Diffusion policy is based on the (very famous lately) Diffusion Model. The diffusion model is a generative model, i.e. a model that can learn the distribution of a dataset and therefore be able to create new data points from this distribution. For example, you can train a diffusion model to generate high-quality, complex data like images, audio, and text. Famous models that vibe with this method include DALL-E and Stable Diffusion for text-to-image generation.

Wondering why diffusion models are getting all the hype? Well, it’s because they have left Autoencoders and GANs in the rearview mirror.

Diff- What?

Diffusion models are inspired by the physical process of diffusion, which describes how particles spread out over time due to random motion. In machine learning, this concept is abstracted into a model that describes a process of gradually adding noise to data until it becomes indistinguishable from random noise.

The main idea here is that we split the noise addition into many small steps, during which we add just a little bit of noise. At the same time, we learn how we can remove this little noise. It’s a lot easier learning the denoising of this small corruption step than bigger ones. This is why we put all this effort into small chunks.

Feeling lost? Let me break it down for you.

How does a diffusion model work in general?

Have you heard of Hidden Markov Models (HMM)? Yes? Great! No? Do not worry!

The only thing you need to know is that an HMM is a chain of probabilistic events, depicted as a node and that each node (probabilistic event) is dependent ONLY on the previous event. Not all the past. And why is this good? Simply, because we do not have to compute what happened before this node.

Diffusion probabilistic models are parameterized Markov chains trained to gradually denoise data. [3]

A diffusion model has two stages: The Forward diffusion process and the Reverse diffusion process

During the Forward process, the original image from our dataset is sprinkled with just a pinch of random noise, for many steps, until the image looks like the screen of an old TV with no signal, i.e. random noise.

Now, by splitting the Forward process into many tiny steps of noise addition, we have at hand a series of small and manageable diffusions (noise injections) that are easier to learn how to reverse.

This happens during the Reverse process, also called the denoising process, during which we train a neural network (most likely a U-NET) to learn how to undo the small forward noising steps.

By doing that, the network essentially learns how to convert noise (or any kind of initial signal or embedding) into a sample from the learned data distribution. For more details about diffusion models, you can read this great article by Lilian Weng.

Why Diffusion Policy?

Diffusion models dodge the mode collapse drama by taking it slow and steady, through their gradual, stochastic approach and reliance on Markov chains. By progressively adding and then removing noise across many steps, they learn to represent the entire data distribution, not just dominant modes from the collected dataset. This exposure to diverse states and stochastic sampling enables the policy to learn distinct modes and not average out the policies collected for Behavior Cloning (BC). Most of the prominent BC methods suffer from mode collapse or mode preference, which is caused by the different ways humans demonstrate the solution of a task.

Diffusion Policy learns multi-modal behavior and commits to only one mode within each rollout. LSTM-GMM and IBC are biased toward one mode, while BET failed to commit. [1][2]

What’s more, the Diffusion Policy has solid mathematical foundations due to its basis in the principles of denoising diffusion probabilistic models and the use of score function during training. Stable learning is another plus, ensured by the gradual, controlled approach to adding and reversing noise, which minimizes the risk of abrupt changes that could destabilize the learning process.

Finally, Diffusion Policy is also killing it in predicting a sequence of future actions, leading to better temporal consistency. That’s due to Diffusion Models’ dope scalability to high-dimensional output spaces.

How do Diffusion Models apply to policy learning?

So, if we train a Diffusion Model to predict actions, here’s how it would roll: A forward process to gradually add noise to a sampled action sequence and then a backward process to learn how to remove the noise. Then we could sample an action from the learned distribution by injecting random noise into the model and gradually denoinising it up to a noiseless action.

But wait, there’s a plot twist! If we keep it basic, the policy would be out here doing its own thing, ignoring the whole vibe of the environment. Thus, we want the Diffusion policy to take into consideration an observation of the environment too. And the way to do it is by conditioning the denoising (reverse diffusion) processes on the observations. That means, in short, that instead of training a network ε with parameters θ to predict the denoising on the k-th denoising step of an action sample sequence A at timestep t:

“Simplistic” Network

We shall train a network to do that conditioned on the observation O series at timestep t:

Conditioned Network

In fact, the neural network ε learns a score function for the actions. Essentially, the score function acts as a guide that quantifies the most probable direction to adjust data points during the generative process. It computes the gradient of the log probability density of the data with respect to itself. This means the score function indicates how to subtly shift a data point to increase its likelihood under the model’s learned distribution.

Score function of Diffusion Policy

By following the direction suggested by the score function during the reverse diffusion process, the model effectively sculpts the noisy data back into coherent, structured forms, in this case, actions. This process gradually refines randomness into meaningful outputs, leveraging the score function’s insights at each step to ensure that each modification makes the output more realistic and aligned with the training data’s characteristics.

The loss function of the Diffusion Policy

Does it work though?

Diffusion Policy got put to the test in all sorts of tasks, both in the matrix (simulation) and out here in the real world. In the simulated domain, the model was tested on tasks like Lift, Can, Square, Transport, Tool Hang, and Push-T, showcasing its versatility in shorter-horizon and mid-difficulty tasks.

In more complex benchmarks like the Kitchen task and Block Push, Diffusion Policy demonstrated its strength in handling long-horizon planning and effectively managing multi-step object manipulations. The real-world evaluations further highlighted the model’s robustness, as it excelled in challenging tasks such as mug flipping, sauce pouring, and bimanual manipulation, which required precise coordination and handling of both rigid and non-rigid objects.

Across all these tasks, Diffusion Policy achieved an impressive average improvement of 46.9% in success rates, significantly outperforming state-of-the-art methods like BET[4], LSTM-GMM[5], and IBC[6]. The model’s ability to capture multimodal action distributions was particularly evident in short-horizon tasks, while its superior handling of sequential planning allowed it to do great in long-horizon tasks too. For instance, in the Push-T task, the Diffusion Policy managed multiple approaches to the target more effectively than the baselines, and in the Kitchen task, it completed complex, multi-step sequences with greater efficiency.

Additionally, the model maintained high performance even with up to 4 steps of latency, crucial for real-time applications. These findings underscore Diffusion Policy’s advanced capabilities and establish it as a leading approach for visuomotor policy learning, aka BC, in both simulated and real-world settings.

And that’s a wrap!

If you’re feeling this, smash that clapping button, save it for later, and maybe even hit follow for more 🔥 content!”

Disclaimer: I am not affiliated with the work discussed in this article. I just found these papers challenging to grasp initially, so I decided to compose this article to provide keen readers with a smoother introduction to the concepts.

References

[1]Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., & Song, S. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. In Proceedings of Robotics: Science and Systems (RSS).

[2] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, & Shuran Song (2024). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. The International Journal of Robotics Research.

[3] Jonathan Ho, Ajay Jain, & Pieter Abbeel (2020). Denoising Diffusion Probabilistic Models. arXiv preprint arxiv:2006.11239.

[4] Nur Muhammad Mahi Shafiullah, Zichen Jeff Cui, Ariuntuya Altanzaya, & Lerrel Pinto. (2022). Behavior Transformers: Cloning $k$ modes with one stone.

[5] Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, & Roberto Martín-Martin (2021). What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. CoRR, abs/2108.03298.

[6] Ajay Mandlekar, Fabio Ramos, Byron Boots, Silvio Savarese, Li Fei-Fei, Animesh Garg, & Dieter Fox. (2020). IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data.