Machine learning: Sigmoid function, softmax function, and exponential family

The sigmoid function and softmax function are commonly used in the field of machine learning. And they are like “least square error” in linear regression. They can be derived from certain basic assumptions using the general form of Exponential family. Some of the basic linear regression and classification algorithms can also be derived from the general form. Let’s dig deep and see how we obtain the mysterious functions.

Exponential family

Exponential family includes the Gaussian, binomial, multinomial, Poisson, Gamma and many others distributions. Loosely speaking, a distribution belongs to exponential family if it can be transformed into the general form:

η is canonical parameter
T(x) is sufficient statistic
A(η) is cumulant function

The regularity conditions of exponential family is mathematically rigorous. It can be referred here:

Nice properties of the general form

The general form of exponential family contains nice properties for constructing machine learning models.

  1. Calculating moments 
    First derivative of the cumulant function is mean, while second derivative is the variance of the corresponding distribution. The cumulant generating function of exponential family distributions can be considered as A(η), which can be treated as an alternative way to calculate moments of a distribution. For moment generating function, we need to calculate the integral, however, for cumulant generating function, we just have to calculate the derivative, which is much more simple.
  2. Obtaining sufficient statistics 
    The sufficient statistics, T(x), can be obtained by inspection. The intuitive explaination of sufficiency is: Having observed T(x), we can throw away X for the purposes of inference with respect to θ. 
     For example, T(x)=x is sufficient statistics for bernoulli distribution and T(x)=[x,x²] is the sufficient statistics of gaussian distribution
  3. Obtaining a general formula for maximum likelihood estimation 
    We can obtain a generalized formula for maximum likelihood estimates of the parameters in exponential family distributions. For example, for mean estimation, we have:

Transforming distributions into general form

It is easy to transform a distribution into the general form. And we can gain insight from the general form.

Consider Bernoulli distribution

Solving π in terms of η, we have:

, which is the sigmoid function.

Similarly, we can transform the multinomial distribution and obtain:

,which is the softmax function.