M2.1 Softmax in Julia

6 min readDec 30, 2018

The goal of this post is not to motivate and explain softmax, nor to provide the best softmax function for Julia, nor to introduce the Julia language. I’ll dispense with philosophical talk of the “mathematical Ideal” and why I think mathematics is a poetic endeavour. But keeping consistent with the theme of my first post on this I’ll be implicitly exemplifying why the act of doing mathematics today means accepting a form of abstraction towards simplicity, i.e. an Ideal, by embracing an ever-growing deconstruction of the elegance of a mathematical function, proof, or equation.

Doing math on and with computers is an ever-growing domain and it is fundamentally different than the way we learn math in school. Namely, it requires deconstructing eloquent equations, accepting approximations, reducing to discretization, and encoding mathematical primitives as programming language types.

The Softmax function is the focus. Softmax is a good choice. In the previous post I showed how you can get to softmax by simply playing around with Euler’s number. There is no need to start from an elegant equation and break it down; instead one can start at “the messy bottom”, so to speak, and work up to abstraction (i.e., mathematical Ideal).

Softmax is also a good choice because it is a heavily used function in some of the worlds most advanced software. Machine Learning applications have used softmax as a “differentiable” argmax for doing multi-class classification for many years. The avante garde Deep Learning frameworks, Tensorflow, PyTorch, MXNet, Flux, Gorgonia, etc… all use softmax (typically for loss function in the last layer of a network where one needs multi-class classification). For this reason, one can find many resources on softmax, including software implementations embedded within the advanced software projects I mention. Note that in the Python projects (Tensorflow, Pytorch, MXNext) they all depend on the low-level C++ execution, which means you’ll likely find both Python and C++ implementations and it’ll be quite noisy. If you want to see an implementation in a cutting edge software project I’d suggest either FluxML (written in Julia), or Gorgonia (written in Go).

Julia Implementation

The easiest way to begin is with a naive version, which comes about as close as one can to matching in software what one finds in mathematical representations.

f(x) = exp.(x) ./    
sum(exp.(x))

What I find interesting here is that the elegance of the mathematical representation is considered naive from a computational point of view. In all branches of mathematics, and to all working mathematicians, evolving ideas toward representational elegance, simplicity, and interpretability (e.g., abstraction towards simplicity, what I call in the previous post mathematical Ideal) is the pinnacle of mathematical achievement. However, in the computational world, where finite resources are the reality and discrete values the necessity, such Idealism is cast as naive.

It’s a funny tension. Software writers appreciate elegance, simplicity, and interpretability too, but not at the cost of broken computation and/or buggy programs.

What is broken in f(x) = exp.(x) ./ sum(exp.(x))? The Julia implementation will eventually overflow its numerical capacity if given numbers too large.

Python, Go, and Rust will too. Any programming language will. One needs to understand the numerical limits of both the programming languages and the devices. In this case, for Julia we can see that the (typemin(Float64), typemax(Float64) of floating point numerals is -Inf, Inf, a representational convenience for machine epsilon eps computed to 2.220446049250313e-16; that is our upper limit and softmax function above exceeds that numerical limit if our numbers are too high.

julia> (typemin(Float64), typemax(Float64))
(-Inf, Inf)julia> (typemin(Inf), typemax(Inf))
(-Inf, Inf)julia> eps(Inf)
NaNjulia> eps(Float64)
2.220446049250313e-16

To provide numerical stability subtract the highest number from all the others. Now [1.1, 5.0, 2.8, 7.3 is [1.1-7.3, 5.0-7.3, 2.8-7.3, 7.3-7.3].

One last thing we do is provide a scaling parameter θ (theta) that allows us to grow the numbers out. Creating greater distance between the small and large numbers makes it easier to see where relative boundaries between large and small happen to be. We defined θ = 2.5. Being mindful of computations is a good thing too. That is, we get the same results if we multiply all our numbers by θ and subtract by max [two scalar-vector multiplications, (scores4 * θ .- maximum(scores4 * θ))], or, if we subtract by max and then multiply by θ [one scalar-vector multiplication, (scores4 .- maximum(scores4)) * θ].

We’ve gone from original values [1.1, 5.0, 2.8, 7.3], to subtracted maximum values [-6.199999999999999, -2.3, -4.5, 0.0], to scaled and subtracted maximum values [-15.499999999999998, -5.75, -11.25, 0.0]. You can see we’ve increased the relative distances quite a bit.

Since we are doing this in Julia we can be mindful of our types. Julia provides very interesting sets of numeric types as well as vector and matrix types. I can use a Real type for all Real numbers, or various kinds of Abstract types that allow me to construct functions that can accept a wide a variety of types of Vector, Matrix, Float, etc.

Julia also has what is called multiple dispatch, wherein the parameters of my functions will dispatch various methods. This means I can write many different functions called softmax that accept different parameters and Julia’s compiler will construct methods for those specifically defined functions (based on parameters). I can inspect those methods for type stability and work towards building faster executing programs by exploiting types and type combinations that yield better compiler output (the above methods show a Union, we probably don’t want that and for an optimization pass we’d likely refactor a bit depending on our goals for the program).

The goal of this post is not to motivate and explain softmax, nor to provide the best softmax function for Julia, nor to introduce the Julia language. Most software needing softmax already has it implemented, and it is implemented in ways that best suit the program goals, the programming language properties, and the kinds of devices and hardware engineers expect to run on.

Instead, the goal here is to highlight just how far away this softmax implementation is from the Ideal representation found in mathematical texts. In considering Julia’s type system, how functions and methods are compiled, and the numerical guts of this specific language, we’ve traveled far from the simplicity and elegance of the equation.

The likelihood any of us will be doing impactful, real-world math on paper is pretty low (outside of teaching and higher-order theory). And if you are in the applied or computational math world then you likely use paper for learning, notes, ideas, sketches, and personal enjoyment. None of which take the place of software. And in software, the gap between the mathematical Ideal and a real-world computable value is very high.

Other Resources for softmax

Honestly, Softmax is everywhere these days. Just take a Udacity or Coursera class on data science or deep learning, pick up any recent book on deep learning, or even older books on machine learning (randomly grabbing two books in my office, published 2009 and 2004 respectively, both have entries in the appendix for softmax).

But if you want some quick reads and/or a refresher… some resources I recommend; two blog posts:

A softmax function for numpy.

A numpy function to apply the softmax along arbitrary axes.

nolanbconaway.github.io

The Softmax function and its derivative - Eli Bendersky's website

In ML literature, the term "gradient" is commonly used to stand in for the derivative. Strictly speaking, gradients are…

eli.thegreenplace.net

selection from an online book:

Neural networks and deep learning

When a golf player is first learning to play golf, they usually spend most of their time developing a basic swing. Only…

neuralnetworksanddeeplearning.com