Adelta: Automatic Differentiation for Discontinuous Programs — Part 1: Math Basics
[Adelta: Automatic Differentiation for Discontinuous Programs — Part 2: Introduction to the DSL]
[Adelta Tutorial — Part 1: Differentiating a Simple Shader Program]
[Adelta Tutorial — Part 2: Raymarching Primitive]
[Adelta Tutorial — Part 3: Animating the SIGGRAPH logo]
[Adelta Tutorial — Part 4: Animating the Celtic Knot]
Over the past decade, automatic differentiation (AD) has profoundly impacted the computer science community, such as in deep learning and inverse rendering. AD allows programmers to conveniently and automatically obtain correct gradients for forward programs that they implement within frameworks such as TensorFlow or PyTorch. Typically, the term “backpropagation” is used within learning contexts, which is just a special case of reverse mode automatic differentiation. All of these AD methods ignore program discontinuities, such as if-else branching, and instead treat them as continuous.
However, many programs, such as shader programs used in rendering, intrinsically rely on discontinuities to characterize object boundary, visibility, and ordering. By discontinuities, mathematically, we mean that some function value calculated in the forward pass is not C0 continuous, so it has a “step.” The gradient generated by traditional AD usually struggles to optimize parameters involving such discontinuities, because AD does not compute any gradient for the discontinuity itself. Similar discontinuities can occur naturally in programs within other contexts such as certain physics simulations, reinforcement learning, discrete-continuous optimization, and audio, so we wished to develop a general approach that extends AD to handle a wide variety of programs.
To address such discontinuities, we introduce an extension of automatic differentiation that works by replacing the usual rules of calculus in our paper “Aδ: Autodiff for Discontinuous Program — Applied to Shaders”. The Aδ framework generalizes reverse-mode automatic differentiation so that it can approximately differentiate discontinuous programs with both theoretical and empirical evidence for its approximation accuracy. Because the usual rules of calculus have stood relatively unchanged for centuries, and it was not obvious that they can be safely changed in a somewhat radical manner to differentiate through discontinuities using AD, in our paper, we sketched out rigorous proofs of our new rules’ accuracy.
This post is the first in a multi-part tutorial: here we give a more intuitive motivation and derivation of the basic math behind our new derivative rules for discontinuous programs, presented with a lower level of rigor than in our paper. Readers who are more interested in coding aspects can also skip to subsequent posts where we cover how to write code that uses our implementation of the Aδ automatic differentiation framework. For additional technical details and our implementation of the framework, please refer to the research paper and the project page.
Motivating Example
Here we show a simple program that draws a circle. The shader program uses if-else branching to test whether a certain pixel is inside the circle, then draws the color accordingly.
If we want to optimize the radius and circle position to best match a target image, using the gradient from the AD framework will not work, because its gradients wrt all three parameters (ox, oy, r) are zero.
One alternative is to use finite difference (FD). However, unlike AD which has time complexity O(1), FD scales linearly as the number of parameters grows, and therefore is not practical for many applications.
Fortunately, in many cases, the gradient at discontinuities can be expressed using the Dirac delta distribution. In this post, we will introduce the concept of the Dirac delta distribution, and show how Aδ incorporates terms based on such Dirac deltas into an Automatic Differentiation framework to accurately differentiate programs with discontinuities.
Introducing the Heaviside Step Function and Dirac Delta Distribution
Mathematically, we can express piecewise functions that are similar to “if” statements in a programming language by using compositions of Heaviside step functions, which evaluate to 0 on one side of the origin and evaluate to 1 on the other side.
There is some difference between the piecewise function notation used in math and “if” statements as typically implemented in programming languages, since “if” statements are typically implemented as conditionals that use branching execution, whereas in math notation there is usually no notion of execution. But for the purposes of developing the math, we can disregard details of program execution.
The derivative of the Heaviside step function is a Dirac delta distribution. Formally, the Dirac delta distribution is a linear functional that is best understood within the theory of distributions, which is beyond the scope of this post. But informally, we can present some intuition for the Dirac delta by treating it as if it were a “function” (though it is actually a distribution) that evaluates to infinity at the origin, and evaluates to 0 everywhere else. Moreover, its integral over the entire real line (or any segment that includes the origin) evaluates to one.
Two important properties for the Dirac delta are the sifting and scaling properties, which we will be using to derive gradients for discontinuous programs.
Introducing Our Pre-Filtering and Sampling Axis
In the above intuition of the Dirac delta as a “function,” δ(x) evaluates to infinity only precisely at x = 0, so this can obviously be challenging to treat computationally. Thus, real-world applications often first pre-filter the discontinuous function (or program) to “fatten” the “spike to infinity” in δ(x) at x = 0, and avoid needing to directly sample precisely the location of the discontinuity.
To simplify the math, we assume the parameter being differentiated wrt is always orthogonal to the pre-filtering kernel direction. For example, given a function f(x, θ), we can first pre-filter with a 1D kernel ϕ(x) before differentiating wrt θ:
Throughout this post, we use the hat symbol (^) over a function to indicate this same prefiltering. The pre-filter ϕ(x) is usually chosen to have compact support and integrate to one over the real line. We denote the direction of the 1D pre-filtering kernel (i.e. x in the above example) as the sampling axis, although we also note that our 1D pre-filter can be oriented arbitrarily within the parameter space used by the function. As will be seen in later sections, the sampling axis is used by our framework to locate the discontinuity and make approximations to the gradient.
Differentiating Discontinuous Functions
We first look at a simple example:
Differentiating f₀ wrt θ can be carried out as follows:
In the above derivation, the second line comes from the chain rule, and the third line is using the Dirac delta’s scaling property. The last equation applies change of variables to the argument inside the Dirac delta, which is needed if we pre-filter using the kernel ϕ(x) before differentiating. xᵢ corresponds to the discontinuity locations given θ. We can now repeat this process using the pre-filtered function with a pre-filter kernel ϕ(x):
The above derivation simply combines the equation above it with the sifting property of the Dirac delta. In practice, our framework can be shown to be approximately correct if the pre-filtering kernel ϕ(x) has small enough support such that at most one discontinuity is sampled. Under this assumption, if we denote the closest (if any) discontinuity location to x as xd, the gradient for the pre-filtered example becomes:
This is now the same as Equation (1) in our research paper. Now that we have derived the pre-filtered gradient for f₀ = H(g(x, θ)), we are one step closer to differentiating arbitrary programs with discontinuities. But we are not quite there yet. Specifically, two particular challenges need to be addressed:
- How to determine the discontinuity location, xd?
- How to differentiate more complex functions when the Heaviside step function is combined with other operators, such as multiplication, or function composition?
Sampling the Discontinuity
We first address the challenge of determining the discontinuity location.
If g(x, θ) has a simple form, e.g. being affine or invertible, xd can be conveniently expressed as xd = g⁻¹(0; θ). However, this may not always be the case, especially for arbitrary programs. In Aδ, the discontinuity location xd is determined based on sampling that corresponds to the support of the pre-filtering kernel.
In Aδ, we always choose a 1D box kernel for ϕ(x):
(Note that our paper scales the endpoints -ϵ, ϵ of the box kernel by scale factors α, β, respectively, to allow the kernel to be positioned flexibly, but in this tutorial we remove those scale factors to simplify the derivation). A box kernel has the extra benefit of being piecewise constant: its value can be easily computed based on the boolean condition of whether xd ∈ [-ϵ, ϵ]. We can see this by plugging the box kernel into the derivative for the pre-filtered f₀:
After the substitution, xd is still used in two places: in the function evaluation of g and in the Boolean condition.
For the function evaluation of g, as long as the function being evaluated is locally Lipschitz continuous, the error to the following approximation is always bounded by first-order terms wrt ϵ:
We now discuss the Boolean condition xd ∈ [-ϵ, ϵ]. Recall that xd denotes the discontinuity location such that g(xd, θ) = 0 and we assume that the kernel support is small enough such that there is at most one discontinuity within the support. Under these assumptions, we can instead check whether the Heaviside step function H(g(x, θ)) has flipped state between the two ends of the kernel support:
Here, we use g⁻, g⁺ to denote function evaluations at two ends of the box kernel’s support.
Before we piece together everything, there’s one extra trick we could apply to make the gradient program more efficient. Since we are already evaluating on two ends of the box kernel’s support, we can use finite difference to replace the denominator |∂ g/∂ x| in the derived gradient assuming locally Lipschitz continuity of g.
Note because we already assumed the sampling axis x is orthogonal to the parameter θ that is being differentiated wrt, the partial and total differentiation is equivalent. Additionally, approximating |∂ c/∂ x| using finite difference avoids an extra backward pass for differentiating wrt x.
Combining everything together, we have the following approximation:
Now we are ready to summarize our first gradient rule for differentiating the Heaviside step function. We use ∂O f (O indicates “our method” to distinguish from the usual calculus derivative) to denote differentiating a pre-filtered function using ϕ(x):
First, note that this is quite different than the ordinary differentiation operator ∂ that is taught in high school, which traditional AD uses, which gives a derivative of zero almost everywhere except where g = 0 where it is undefined. Our derivative operator recovers in contrast a sampled approximation to a prefiltered Dirac delta. Note also instead of directly computing ∂ g/∂ θ on the right hand side as in our previous derivation, here we use ∂O g/∂ θ to allow recursive application of reverse-mode automatic differentiation to g and all the compute nodes it depends on. Also, note that we omit O(ϵ) error terms for simplicity, but in the paper, we prove by induction that recursively applying our gradient rule to a certain subset of the programs defined in our DSL gives a first-order error bound O(ϵ). We thus refer to our rules as being “first-order correct.”
Aδ’s Product Rule
A next challenge is to ask how we might differentiate an arbitrary program (or function) that could involve the composition of the Heaviside step function with other operators, such as multiplication or atomic function composition (sin, cos, sqrt, etc).
One straightforward approach would be to use for these operators the ordinary rules of calculus that are taught in high school, the same as traditional AD uses. However, as we will show in the following counterexamples, these rules break down because they only sample on one side of the discontinuity.
Let’s first look at a simple multiplication example:
On one hand, because we can make the following simplification: f₁(x, θ) = H(g(x, θ)), its pre-filtered derivative can be easily computed using the rule derived from the last section:
On the other hand, if we directly combine the traditional calculus product rule (i.e. what AD uses) with our rule for the Heaviside step function, we get the following incorrect result:
The above result that relies on the calculus product rule used by AD is clearly different from what we derived before. The error is that naively applying the calculus product rule assumes that it is safe to differentiate under the integral associated with the pre-filter, however, attempting to do that results in a discontinuous term (i.e. 2H(g(x, θ))) times a Dirac delta term (i.e. δ(g(x, θ))), both singular at the same location (i.e. when g(x, θ) = 0), which is ill-defined.
Instead, we develop a product rule that can correctly handle such cases by sampling on both sides of the discontinuity:
We can verify its gradient is correct for f₁(x, θ).
The last line of the equation above makes use of the fact that H(g) can only evaluate to two values: 0 or 1, and in that case we are given that H(g⁺) ≠ H(g⁻).
Aδ’s Chain Rule
We now motivate the design of Aδ’s chain rule — used for function composition — by rewriting the same function f₀ and f₁ we used before in a new way: as H(g(x, θ)) composed with a square atomic function.
We can easily compute its gradient using the equivalence f₂(x, θ) = H(g(x, θ)) together with our rule for the Heaviside step function:
Applying the traditional calculus chain rule to the function composition and using our differentiation rule for Heaviside step again gives us an incorrect result:
The incorrect result computed when we rely on the traditional calculus chain rule is caused by a similar issue as in the multiplication case above: it ends up being unsafe to differentiate under the integral because this generates a multiplication between a discontinuous term (i.e. H(g(x, θ))) and a Dirac delta term (i.e. δ(g(x, θ))), both singular at the same location. Our rule, on the other hand, avoids this by sampling function values before and after composition on both sides of the discontinuity:
The first condition uses the ordinary calculus chain rule, and is used both for efficient computation when discontinuity is not present, and to safeguard against the scenario that the sampling axis x is unable to sample the discontinuity. We will not discuss this rule in detail and it suffices to say we use the second line of our function composition rule whenever a Heaviside step function is involved. Again, we can verify its gradient is correct for f₂(x, θ).
Summary
This post motivated and explained the fundamental formulas used in Aδ, a framework with a new set of gradient rules for automatically differentiating arbitrary functions with discontinuities. In the table below, we summarize the rules between Aδ (denoted as O for “Ours”) and traditional AD, which uses the ordinary rules of calculus.
Aδ’s gradient rules can be used to efficiently carry out a new kind of reverse-mode automatic differentiation, where we replace in the AD framework the usual single-point samples used in AD with two samples, denoted + and — in the above table. In our paper, we prove these rules are first-order correct for a large set of programs constructed using our domain specific language (DSL) syntax. In the next post, we will introduce more about Aδ’s DSL and the broad range of discontinuous programs it can express.
Reference
Aδ: Autodiff for Discontinuous Program — Applied to Shaders [project page]