I. Introduction

This is the second part of my blog series on implementation of a nonlinear optimization solver, which I have resumed after a brief hiatus. In my previous blog, I alluded to the implementation of the fundamental building blocks of a nonlinear optimization solver which includes differentiation engine, linear algebra engine, numerics etc. In this article, specifically, I would like to discuss about the implementation aspects of the differentiation engine and discuss on the several methods involved in computing derivatives of mathematical functions.

Derivatives of mathematical functions form the part and parcel of an optimization solver. The Hessians or Jacobians of the objective and equality/inequality constraints form the basis for solving the underlying nonlinear optimization problem. Typically, the derivative based methods could be classified into two types which are the first order methods and second order methods. The first order methods primarily requires two information to solve the underlying nonlinear optimization problem which are

  • Objective and constraints mathematical functions as oracles, i.e. the ability to query the function and get the appropriate deterministic output.
  • Jacobians i.e. the gradients a.k.a. first order derivatives of the mathematical functions (If the function is scalar, then the Jacobian is a vector and if the function is a vector, then the Jacobian is a matrix).

The second order methods require the aforementioned information of the first order methods and as well as the Hessians of the mathematical functions a.k.a. second order derivatives (predominantly the Hessian of the objective function).

Thus, in order to solve a nonlinear optimization problem using either first or second order methods, the computation of derivatives becomes an indispensible tool. The simplest way to compute the derivatives of the mathematical functions is to jot down the derivatives by using the basic rules of calculus and implement it via a computer code. However, this approach is error prone due to human error and not tenable if the mathematical terms are huge, nested and also, when the number of optimization variables is exponentially large. Therefore, a computer aided method would be the only recourse to handle the derivatives computation for large optimization problems.

In the literature of computer aided derivatives computation, one would perhaps come across three different classes of methods which I have listed below

  • Symbolic differentiation (SD)
  • Automatic/Algorithmic differentiation (AD)
  • Numerical differentiation

The last category, literally from its definition it is evident that the derivatives are approximated using some simple mathematical tricks (e.g. the imaginary trick [1]) or simply by the mathematical definition of a derivative. The following example illustrates a simple way to compute first order derivative of a univariate function using Euler’s method

Fig1. Euler method for derivative of a function f where x is the optimization variable and h is the step size

Numerical differentiation class of methods provides numerical approximations for the derivatives, however, SD and AD classes compute the exact derivative of the underlying function. If you look for articles, blogs, books, forums etc. perhaps it is not uncommon that you might get perplexed upon the definition and differences between the former and latter classes as there isn’t a common consensus among many users and authors on this topic. It is certainly possible that the enquirer might get dismayed upon the variegated replies and explanations listed for the two different classes. Or is it really two different classes of methods?. Well, I will try to provide my side of the explanation of the concept and hopefully convince you about my point of view :).

As going by the age old adage that “A picture is worth a thousand words”, in the same spirit, the best way to explain and understand a technical concept is by coding it up and visualizing the algorithm. Therefore in this article, firstly I would like to talk about modern C++ implementation of a SD module (which I have made through a series of video lectures) and in subsequent articles, I would like to talk about AD (both forward and reverse mode) and if possible discuss on the state of art of numerical differentiation methods.

II. Symbolic Differentiation (SD)

The textbook definition of SD is given as the following statement — “A symbolic differentiation program finds the derivative of a given formula with respect to a specified variable, producing a new formula as its output. In general, symbolic mathematics programs manipulate formulas to produce new formulas, rather than performing numeric calculations based on formulas “[2]. I believe this definition by itself is self-explanatory, however now the next big question is “What is the difference between a SD and AD?”. Now, as I have planned to talk about AD in my next blog, however, in the interest of quelling this conundrum, let me provide a small heads-up on this topic and demystify the differences between AD and SD.

The textbook definition of AD as per Wikipedia is given as the following statement — “In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation, computational differentiation, auto-differentiation, or simply autodiff, is a set of techniques to evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program” [3]. This is once again self-explanatory, however a little depth into the subject would reveal a great deal on the differences between AD and SD.

AD has two modes of operation - Forward mode (a.k.a Tangent mode) and Reverse mode (a.k.a Adjoint/Co-tangent mode), where the former processes the mathematical expression from input to output and latter from output to input. The working principle of SD is basically the same as the working principle of Forward mode AD, however the differences lie at the implementation level or to be more specific, the underlying data structure utilized for both the methods. According to the paper [4], the author proposes that typically a forward mode AD utilizes a directed acyclic graph (DAG) data structure and SD utilizes an expression tree (ET) data structure to represent the underlying mathematical expression. However, it is always possible to cast a DAG to ET and vice versa and produce an operational level equivalence between SD and forward mode AD, but, this is fungible at a computational overhead.

It is also important to consider the context where the two applies i.e. in a SD framework, ultimately what really matters is the correctness of the output, however in an AD framework, the means of implementation also matters along with the correctness of the output. So in short, in an operational perspective, SD is a careless implementation [5] of forward mode AD and this applies the other way round too. In such a case, this leads to a phenomenon known as expression swell which slows down the derivatives computation and the expression evaluation. As an example, consider the following mathematical expression

Fig2. f is the mathematical function of interest and x1, x2 are the optimization variables

The computational flow diagram of the above mathematical function in ET and DAG forms are given below

Fig3. The left image represents a ET form and right image represents a DAG form of the mathematical expression

From the above computational flow diagram, it is evident that ET allows repetition of common sub-expression (i.e. the additive terms) however, this is avoided in DAG. In the case of SD, when computing derivative the chain rule is repetitively applied to the expression and it might happen that the expression would allow several common sub-expressions to seep in and leading to an expression swell, however, this is avoided in a DAG framework. The coding part of a SD will be covered later in this article, however a quick sneak peek into the abstract syntax tree (AST) expounds the phenomenon of expression swell due to Cartesian class explosion.

a) Symbolic mathematical expression AST class structure

class VariableProduct<class VariableSin<class VariableSum<class Variable,class Variable>>,class VariableCos<class VariableSum<class Variable,class Variable>>>

The above class structure is the realization of the ET for the mathematical expression illustrated in Fig3 (left diagram).

b) Symbolic derivative expression AST class structure

class VariableSum<class VariableProduct<class VariableProduct<class VariableCos<class VariableSum<class Variable,class Variable>>,class VariableSum<class Variable,class Variable>>,class VariableCos<class VariableSum<class Variable,class Variable>>>,class VariableProduct<class VariableSin<class VariableSum<class Variable,class Variable>>,class VariableProduct<class VariableProduct<double,class VariableSin<class VariableSum<class Variable,class Variable>>>,class VariableSum<class Variable,class Variable>>>>

So, by induction, you could conjure up the ET for the above class structure which corresponds to the derivative of the mathematical expression shown in Fig2.

Now let’s not get into the implementation details immediately, as all I am trying to say is that the derivative computational tree has several duplicates of common sub-expressions and this has devolved to a complicated mess, but what really matters is that it works and the output is right!. This is a typical scenario of expression swell in the case of SD.

III. SD Working Principle

To understand the working principle of SD, it is important to understand the principle of seeding and computation of partial derivatives by cycling the seed vector. This concept is better understood using an example, so, consider the following function

Fig4. f is function of interest and x1 and x2 are the optimization variables

The goal here is to compute the partial derivatives of the above function with respect to x1 and x2. Assume that we compute the total derivative of the function f with respect to a variable say t. Then, by the rules of calculus (chain rule) the total derivative is computed as

Fig5. Total derivative of the function f using the law of total derivative

This can also be represented inthe following form

Fig6. Compact form of the total derivative

Now, to compute the partial derivatives with respect to x1 and x2, one has to seed the values the total derivative terms (xd1, xd2). If xd1 = 1 and xd2 = 0, then substituting this on the above equation yields the partial derivative with respect to x1, i.e.

Fig7. Partial derivative with respect to x1

If xd1 = 0 and xd2 = 1, then substituting this on the above equation yields the partial derivative with respect to x2, i.e.

Fig8. Partial derivative with respect to x2

If you perform the partial derivatives by hand you would obtain the same results. In case if you have n variables, then for n times the seed vector has to cycled with one component of the vector being one and the rest being zero. By doing so, you would obtain all the partial derivatives with n variables. The mathematical condition of a seed vector with n variables can be represented in the following form

Fig9. Mathematical representation of a seed vector

Just of the mathematically inclined people, this is a classic simplex form in integer spaces.

The concept of seed cycling can be implemented in multiple ways, however, in this article I use a preorder tree traversal for the generated ET to evaluate and compute the corresponding derivatives.

IV. SD Implementation

Despite, I have somewhat explained about the theory behind SD, however, when it comes to practice, the implementation has its own divisions. In the context of C++ programming language, SD could be implemented in two different ways which are Compile-time SD and Run-time SD.

Compile-time SD implementation in C++ utilizes advanced concepts such as template meta-programming, operator overloading, design patterns such as CRTP for static polymorphism, expression templates, mixin inheritance etc. which aids in structuring and processing the AST during compile time and therefore, one could obtain fast performance with a little or no computational overhead (e.g. construction of V-tables in case of dynamic polymorphism). However, the downsides include longer compilation period and particularly, the most important downside is the inability to dynamically change the expressions. Every expression has to be explicitly written and remain static in nature so the compiler could process it at the very beginning of the compilation stage.

Run-time SD implementation can be programmed in two ways which are

  1. By using the same method as compile-time SD, however when coding it, a separate data structure has to be maintained just as AST for compile-time SD. This would provide the flexibility to manipulate the elements in the data structure and thus dynamically modify the mathematical expression.
  2. Another method is to create a textual form of the mathematical expression and by means of parser, tokenizer, lexical analyzer, tree reduction etc. operations (just like a compiler), a data structure can be created to store all the elementary operations along with the operands. Then, in such a case differentiation of the mathematical expression is the same as differentiation of the generated data structure which would yield a new data structure for the derivatives expression.

In this article, as a proof of concept (POC), I would just like to talk about compile-time SD with just 4 operations (+, *, sin,cos) for the expression shown in Fig2. and also the numerical evaluation of the expression (Careless forward AD implementation). So with all that has been said and explained, let me right away jump into the coding part of a simple compile-time SD. I have attached the videos (in 4 parts) below.

Video1. Symbolic differentiation C++ implementation Part 1
Video2. Symbolic differentiation C++ implementation Part 2
Video3. Symbolic differentiation C++ implementation Part 3
Video4. Symbolic differentiation C++ implementation Part 4

V. Conclusions

In this article, I have explained about a simple implementation of SD and also presented my observations on the differences between SD and AD. However, I strongly encourage the reader to search for it themselves and understand the differences and nuances of the both. In the next part, I will discuss about forward AD and reverse AD in detail.

References

[1] URL https://nhigham.com/2020/10/06/what-is-the-complex-step-approximation/

[2] URL https://www.cs.utexas.edu/users/novak/asg-symdif.html

[3] URL https://en.wikipedia.org/wiki/Automatic_differentiation

[4] Laue, S., 2019. On the equivalence of forward mode automatic differentiation and symbolic differentiation. arXiv preprint arXiv:1904.02990.

[5] Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2018. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research, 18, pp.1–43.

--

--