Sparsemax: from paper to code

Published in

Deep Learning made easy

3 min readMay 25, 2021

Here we are with another paper implementation. Veeeery slowly, but I’m on my way to implement TabNet paper. I think this will be the last component I need for it.

Paper: From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

Another github repo implementing sparsemax:
* https://github.com/aced125/sparsemax

What can we learn from implementing this paper?

A new layer capable of producing sparse outputs while being differentiable everywhere.
How to implement the backward step.
How check the gradient is correct.

1) Sparsemax, a new layer

From the abstract:

We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. […] We obtain promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference. For the latter, we achieve a similar performance as the traditional softmax, but with a selective, more compact, attention focus.

So it can be an useful tool used on the intermediate layers (like for TabNet as an attention mechanism) or as the last layer for multi classification problems.

The algorithm is:

It’s pretty straight forward:

Sort the input on descending order.
Find the maximum index (from the sorted set) that meets some condition.
Calculate the threshold (tau(z)).
Subtract the threshold to each element of the original set and apply relu operation (convert negative values to 0).

A naive way of implementing it might look like this in numpy:

Note the variable “col_range” for vectorised execution over all elements on the set in one go.

But this won’t work well when using a batch. With batch execution (2d case) where we have “columns” and “rows” looks as this in pytorch:

We can confirm it’s working fine by using the same test data with other existing implementations like I do on the demo notebook:…

2) The backward step

Typically custom pytorch functions are implemented this way:

ctx: is a “bag” where you can store information required for the backward step. So if you need something from the forward like the output, the input, etc you can store it on the “ctx” variable to use it later.

grad_output: the gradient of the previous step.

grad_input: the gradient of this function with respect to the output. Yeah it’s bit confusing to call something grad_input when is a variable you are returning.

Now the tricky thing is… how to implement it for sparsemax? On the paper you’ll find this formula that looks promising:

Ok, that’s the jacobian matrix of the function. Oh wait that’s sounds scary, what is a jacobian matrix? It’s a matrix composed of all partial derivatives of the output with respect to each element of the input.

Now that we have how to implement the backward step… not so fast! Actually if you implement that won’t work (trust me I already tried :D). We are missing to include the gradient of the previous step.

Lucky us the authors mention how to do it on the paper: