Adelta: Automatic Differentiation for Discontinuous Programs — Part 1: Math Basics

Yuting Yang
13 min readApr 30, 2022

--

[Adelta: Automatic Differentiation for Discontinuous Programs — Part 2: Introduction to the DSL]

[Adelta Tutorial — Part 1: Differentiating a Simple Shader Program]

[Adelta Tutorial — Part 2: Raymarching Primitive]

[Adelta Tutorial — Part 3: Animating the SIGGRAPH logo]

[Adelta Tutorial — Part 4: Animating the Celtic Knot]

Over the past decade, automatic differentiation (AD) has profoundly impacted the computer science community, such as in deep learning and inverse rendering. AD allows programmers to conveniently and automatically obtain correct gradients for forward programs that they implement within frameworks such as TensorFlow or PyTorch. Typically, the term “backpropagation” is used within learning contexts, which is just a special case of reverse mode automatic differentiation. All of these AD methods ignore program discontinuities, such as if-else branching, and instead treat them as continuous.

However, many programs, such as shader programs used in rendering, intrinsically rely on discontinuities to characterize object boundary, visibility, and ordering. By discontinuities, mathematically, we mean that some function value calculated in the forward pass is not C0 continuous, so it has a “step.” The gradient generated by traditional AD usually struggles to optimize parameters involving such discontinuities, because AD does not compute any gradient for the discontinuity itself. Similar discontinuities can occur naturally in programs within other contexts such as certain physics simulations, reinforcement learning, discrete-continuous optimization, and audio, so we wished to develop a general approach that extends AD to handle a wide variety of programs.

To address such discontinuities, we introduce an extension of automatic differentiation that works by replacing the usual rules of calculus in our paper “Aδ: Autodiff for Discontinuous Program — Applied to Shaders”. The Aδ framework generalizes reverse-mode automatic differentiation so that it can approximately differentiate discontinuous programs with both theoretical and empirical evidence for its approximation accuracy. Because the usual rules of calculus have stood relatively unchanged for centuries, and it was not obvious that they can be safely changed in a somewhat radical manner to differentiate through discontinuities using AD, in our paper, we sketched out rigorous proofs of our new rules’ accuracy.

This post is the first in a multi-part tutorial: here we give a more intuitive motivation and derivation of the basic math behind our new derivative rules for discontinuous programs, presented with a lower level of rigor than in our paper. Readers who are more interested in coding aspects can also skip to subsequent posts where we cover how to write code that uses our implementation of the Aδ automatic differentiation framework. For additional technical details and our implementation of the framework, please refer to the research paper and the project page.

Motivating Example

Here we show a simple program that draws a circle. The shader program uses if-else branching to test whether a certain pixel is inside the circle, then draws the color accordingly.

Simple shader program that draws a circle
Target circle image

If we want to optimize the radius and circle position to best match a target image, using the gradient from the AD framework will not work, because its gradients wrt all three parameters (ox, oy, r) are zero.

Gradient for the circle program generated by traditional AD

One alternative is to use finite difference (FD). However, unlike AD which has time complexity O(1), FD scales linearly as the number of parameters grows, and therefore is not practical for many applications.

Gradient for the circle program generated by FD

Fortunately, in many cases, the gradient at discontinuities can be expressed using the Dirac delta distribution. In this post, we will introduce the concept of the Dirac delta distribution, and show how Aδ incorporates terms based on such Dirac deltas into an Automatic Differentiation framework to accurately differentiate programs with discontinuities.

Introducing the Heaviside Step Function and Dirac Delta Distribution

Mathematically, we can express piecewise functions that are similar to “if” statements in a programming language by using compositions of Heaviside step functions, which evaluate to 0 on one side of the origin and evaluate to 1 on the other side.

Heaviside Step Function
Heaviside Step Function: https://en.wikipedia.org/wiki/Heaviside_step_function#/media/File:Dirac_distribution_CDF.svg
Mathematical equivalence between Heaviside step function and piecewise “if” function

There is some difference between the piecewise function notation used in math and “if” statements as typically implemented in programming languages, since “if” statements are typically implemented as conditionals that use branching execution, whereas in math notation there is usually no notion of execution. But for the purposes of developing the math, we can disregard details of program execution.

The derivative of the Heaviside step function is a Dirac delta distribution. Formally, the Dirac delta distribution is a linear functional that is best understood within the theory of distributions, which is beyond the scope of this post. But informally, we can present some intuition for the Dirac delta by treating it as if it were a “function” (though it is actually a distribution) that evaluates to infinity at the origin, and evaluates to 0 everywhere else. Moreover, its integral over the entire real line (or any segment that includes the origin) evaluates to one.

Dirac delta distribution
Dirac delta always integrates to one
Approximating the Dirac delta distribution as a limit of normal distributions as a -> 0: https://upload.wikimedia.org/wikipedia/commons/b/b4/Dirac_function_approximation.gif

Two important properties for the Dirac delta are the sifting and scaling properties, which we will be using to derive gradients for discontinuous programs.

Sifting Property
Scaling Property

Introducing Our Pre-Filtering and Sampling Axis

In the above intuition of the Dirac delta as a “function,” δ(x) evaluates to infinity only precisely at x = 0, so this can obviously be challenging to treat computationally. Thus, real-world applications often first pre-filter the discontinuous function (or program) to “fatten” the “spike to infinity” in δ(x) at x = 0, and avoid needing to directly sample precisely the location of the discontinuity.

To simplify the math, we assume the parameter being differentiated wrt is always orthogonal to the pre-filtering kernel direction. For example, given a function f(x, θ), we can first pre-filter with a 1D kernel ϕ(x) before differentiating wrt θ:

Pre-filtered function

Throughout this post, we use the hat symbol (^) over a function to indicate this same prefiltering. The pre-filter ϕ(x) is usually chosen to have compact support and integrate to one over the real line. We denote the direction of the 1D pre-filtering kernel (i.e. x in the above example) as the sampling axis, although we also note that our 1D pre-filter can be oriented arbitrarily within the parameter space used by the function. As will be seen in later sections, the sampling axis is used by our framework to locate the discontinuity and make approximations to the gradient.

Differentiating Discontinuous Functions

We first look at a simple example:

Simple Example 1

Differentiating f₀ wrt θ can be carried out as follows:

Differentiation of Example 1

In the above derivation, the second line comes from the chain rule, and the third line is using the Dirac delta’s scaling property. The last equation applies change of variables to the argument inside the Dirac delta, which is needed if we pre-filter using the kernel ϕ(x) before differentiating. xᵢ corresponds to the discontinuity locations given θ. We can now repeat this process using the pre-filtered function with a pre-filter kernel ϕ(x):

Differentiate Pre-filtered Example 1

The above derivation simply combines the equation above it with the sifting property of the Dirac delta. In practice, our framework can be shown to be approximately correct if the pre-filtering kernel ϕ(x) has small enough support such that at most one discontinuity is sampled. Under this assumption, if we denote the closest (if any) discontinuity location to x as xd, the gradient for the pre-filtered example becomes:

Differentiate Pre-filtered Example 1 assuming at most one discontinuity within kernel support

This is now the same as Equation (1) in our research paper. Now that we have derived the pre-filtered gradient for f₀ = H(g(x, θ)), we are one step closer to differentiating arbitrary programs with discontinuities. But we are not quite there yet. Specifically, two particular challenges need to be addressed:

  • How to determine the discontinuity location, xd?
  • How to differentiate more complex functions when the Heaviside step function is combined with other operators, such as multiplication, or function composition?

Sampling the Discontinuity

We first address the challenge of determining the discontinuity location.

If g(x, θ) has a simple form, e.g. being affine or invertible, xd can be conveniently expressed as xd = g⁻¹(0; θ). However, this may not always be the case, especially for arbitrary programs. In Aδ, the discontinuity location xd is determined based on sampling that corresponds to the support of the pre-filtering kernel.

In Aδ, we always choose a 1D box kernel for ϕ(x):

Box pre-filtering kernel

(Note that our paper scales the endpoints , ϵ of the box kernel by scale factors α, β, respectively, to allow the kernel to be positioned flexibly, but in this tutorial we remove those scale factors to simplify the derivation). A box kernel has the extra benefit of being piecewise constant: its value can be easily computed based on the boolean condition of whether xd ∈ [-ϵ, ϵ]. We can see this by plugging the box kernel into the derivative for the pre-filtered f₀:

Differentiate Pre-filtered Example 1 using box kernel

After the substitution, xd is still used in two places: in the function evaluation of g and in the Boolean condition.

For the function evaluation of g, as long as the function being evaluated is locally Lipschitz continuous, the error to the following approximation is always bounded by first-order terms wrt ϵ:

Approximating function evaluation using locally Lipschitz continuity

We now discuss the Boolean condition xd ∈ [-ϵ, ϵ]. Recall that xd denotes the discontinuity location such that g(xd, θ) = 0 and we assume that the kernel support is small enough such that there is at most one discontinuity within the support. Under these assumptions, we can instead check whether the Heaviside step function H(g(x, θ)) has flipped state between the two ends of the kernel support:

The Boolean condition can be equivalently evaluated as whether the Heaviside step function has flipped state at two ends of the kernel support

Here, we use g⁻, g⁺ to denote function evaluations at two ends of the box kernel’s support.

Before we piece together everything, there’s one extra trick we could apply to make the gradient program more efficient. Since we are already evaluating on two ends of the box kernel’s support, we can use finite difference to replace the denominator |∂ g/∂ x| in the derived gradient assuming locally Lipschitz continuity of g.

Finite difference approximation along the sampling axis

Note because we already assumed the sampling axis x is orthogonal to the parameter θ that is being differentiated wrt, the partial and total differentiation is equivalent. Additionally, approximating |∂ c/∂ x| using finite difference avoids an extra backward pass for differentiating wrt x.

Combining everything together, we have the following approximation:

Differentiate Pre-filtered Example 1 using box kernel with approximation

Now we are ready to summarize our first gradient rule for differentiating the Heaviside step function. We use ∂O f (O indicates “our method” to distinguish from the usual calculus derivative) to denote differentiating a pre-filtered function using ϕ(x):

Aδ gradient rule for Heaviside step function

First, note that this is quite different than the ordinary differentiation operator that is taught in high school, which traditional AD uses, which gives a derivative of zero almost everywhere except where g = 0 where it is undefined. Our derivative operator recovers in contrast a sampled approximation to a prefiltered Dirac delta. Note also instead of directly computing ∂ g/∂ θ on the right hand side as in our previous derivation, here we use ∂O g/∂ θ to allow recursive application of reverse-mode automatic differentiation to g and all the compute nodes it depends on. Also, note that we omit O(ϵ) error terms for simplicity, but in the paper, we prove by induction that recursively applying our gradient rule to a certain subset of the programs defined in our DSL gives a first-order error bound O(ϵ). We thus refer to our rules as being “first-order correct.”

Aδ’s Product Rule

A next challenge is to ask how we might differentiate an arbitrary program (or function) that could involve the composition of the Heaviside step function with other operators, such as multiplication or atomic function composition (sin, cos, sqrt, etc).

One straightforward approach would be to use for these operators the ordinary rules of calculus that are taught in high school, the same as traditional AD uses. However, as we will show in the following counterexamples, these rules break down because they only sample on one side of the discontinuity.

Let’s first look at a simple multiplication example:

Simple multiplication example

On one hand, because we can make the following simplification: f₁(x, θ) = H(g(x, θ)), its pre-filtered derivative can be easily computed using the rule derived from the last section:

First-order correct derivative for f1 using its equivalence to H(g(x, θ))

On the other hand, if we directly combine the traditional calculus product rule (i.e. what AD uses) with our rule for the Heaviside step function, we get the following incorrect result:

Incorrect derivative for f1 when using calculus product rule for multiplication

The above result that relies on the calculus product rule used by AD is clearly different from what we derived before. The error is that naively applying the calculus product rule assumes that it is safe to differentiate under the integral associated with the pre-filter, however, attempting to do that results in a discontinuous term (i.e. 2H(g(x, θ))) times a Dirac delta term (i.e. δ(g(x, θ))), both singular at the same location (i.e. when g(x, θ) = 0), which is ill-defined.

Instead, we develop a product rule that can correctly handle such cases by sampling on both sides of the discontinuity:

Aδ product rule

We can verify its gradient is correct for f₁(x, θ).

First-order correct derivative for f₁ when using Aδ’s product rule

The last line of the equation above makes use of the fact that H(g) can only evaluate to two values: 0 or 1, and in that case we are given that H(g⁺) ≠ H(g⁻).

Aδ’s Chain Rule

We now motivate the design of Aδ’s chain rule — used for function composition — by rewriting the same function f₀ and f₁ we used before in a new way: as H(g(x, θ)) composed with a square atomic function.

Simple function composition example

We can easily compute its gradient using the equivalence f₂(x, θ) = H(g(x, θ)) together with our rule for the Heaviside step function:

First-order correct derivative for f₂ using its equivalence to H(g(x, θ))

Applying the traditional calculus chain rule to the function composition and using our differentiation rule for Heaviside step again gives us an incorrect result:

Incorrect derivative for f₂ when using calculus chain rule

The incorrect result computed when we rely on the traditional calculus chain rule is caused by a similar issue as in the multiplication case above: it ends up being unsafe to differentiate under the integral because this generates a multiplication between a discontinuous term (i.e. H(g(x, θ))) and a Dirac delta term (i.e. δ(g(x, θ))), both singular at the same location. Our rule, on the other hand, avoids this by sampling function values before and after composition on both sides of the discontinuity:

Aδ’s chain rule used for function composition

The first condition uses the ordinary calculus chain rule, and is used both for efficient computation when discontinuity is not present, and to safeguard against the scenario that the sampling axis x is unable to sample the discontinuity. We will not discuss this rule in detail and it suffices to say we use the second line of our function composition rule whenever a Heaviside step function is involved. Again, we can verify its gradient is correct for f₂(x, θ).

First-order correct derivative for f2 when using Aδ’s gradient rule for function composition

Summary

This post motivated and explained the fundamental formulas used in Aδ, a framework with a new set of gradient rules for automatically differentiating arbitrary functions with discontinuities. In the table below, we summarize the rules between Aδ (denoted as O for “Ours”) and traditional AD, which uses the ordinary rules of calculus.

Gradient rule summary for Aδ (O) and traditional AD

Aδ’s gradient rules can be used to efficiently carry out a new kind of reverse-mode automatic differentiation, where we replace in the AD framework the usual single-point samples used in AD with two samples, denoted + and — in the above table. In our paper, we prove these rules are first-order correct for a large set of programs constructed using our domain specific language (DSL) syntax. In the next post, we will introduce more about Aδ’s DSL and the broad range of discontinuous programs it can express.

Reference

Aδ: Autodiff for Discontinuous Program — Applied to Shaders [project page]

--

--