Convolutional Competitive Learning vs. Sparse Autoencoders (2/2)
This is the second part of our comparison between convolutional competitive learning and convolutional or fully-connected sparse autoencoders. To understand our motivation for this comparison, have a look at the first article.
We decided to compare two specific algorithms that tick most of the features we require: K-Sparse autoencoders, and Growing-Neural-Gas-with-Utility (GNG-U) (Fritzke1997). The latter has proven ability to learn statistically optimal representations even given nonstationary input, under certain conditions.
K-Sparse Autoencoders
We found the K-Sparse Autoencoder scheme of Makhzani and Frey (Makhzani2013) particularly appealing due to the simple manner of achieving the desired sparsity: They simply find k cells with the highest hidden layer activity, and then mask to zero the activity of the remaining hidden cells. It is also possible to directly target any desired sparsity. One downside of the first paper is that the only mechanism to ensure high cell utilization is to train very slowly, with good weight initialization. In their follow-up paper, Winner-Take-All Convolutional Sparse Autoencoders (Makhzani2015), they introduced the concept of lifetime sparsity: Cells that aren’t used often are trained on the most fitting batch samples to ensure high cell utilization over time.
In fact, the realization that you can pretty much use any method you want to include or exclude cells from the active population in the hidden layer means that the k-sparse scheme is a very flexible algorithm and could potentially exploit this flexibility to deal with nonstationary data.
A nice feature of the k-sparse autoencoder is that the completeness of the models learned by the algorithm depend on the sparsity; with greater sparsity, cells represent more complete patterns. With less sparsity, cells learn to represent stable parts of patterns (see figure).
The complexity of features encoded by individual cells increases with reducing sparsity, k. At k=10 cells represent entire digits, whereas at higher k cells represent the strokes that make up each digit. Various subsets of cells effectively share the responsibility for representing an input digit-form. Source: (Makhzani2013)
Convolutional Growing Neural Gas with Utility (GNG-U)
This section describes how we can make GNG or GNG-U a convolutional algorithm. For readers who are unfamiliar with GNG, the first part of this article gives an introduction and links to relevant sources.
Receptive Field Geometry
We can use conventional convolutional geometry for a GNG network layer. The number of cells available to the GNG is Z, the depth of the layer. This is also known as the number of filters in a normal convolutional network. We can define a fixed receptive field and field tiling / stride as normal. The GNG is applied to each convolutional x,y location and the content of the receptive field at these locations is computed as normal. The output of the GNG at each location is two vectors of Z elements each:
- e the sum-square-error of each GNG cell and
- m a ternary mask indicating the winner (m[winner]=2) and runner-up cell (m[runnerup]=1), with other cells having value m[i]=0.
The GNG is trained at each convolutional location, albeit with a proportionally reduced learning rate to account for the number of training steps per image (note: Batch training with a higher learning rate would likely be superior but was not implemented). A winner and runner up are selected at each convolutional location, producing a 3-D ternary matrix with fixed sparsity.
We can build a stack of convolutional GNG layers using interleaved convolutional and pooling layers. In our experiments, we tested 1, 2 and 3 convolutional + pooling layers. Note that for simultaneous online training, this assumes that the GNG(U) can handle nonstationary input.
Pooling functions
GNG is inherently a winner-take-all algorithm, or perhaps, a k=2 -sparse algorithm (it selects a “winner” and a “runner up”, which are modelled as neighbours for training). Normally, the hidden layer activity in convolutional networks is used as an input and output to pooling; we take the max value for each z in the x,y locations being pooled.
However, with GNG we only have the sum-square-error values e and the ternary mask indicating the winners. Neither is obviously equivalent to hidden layer activity in autoencoders or convolutional networks; sum-square-error becomes smaller when the cell is “better”, and is not sparse. The winner and runner-up ternary mask is very sparse, and it is not clear how we should represent the difference in error between the two (sometimes they have similar error, but not always).
Choosing a good output function is a balance between signal loss and noise removal. It is important to produce an output that is stable under reasonable conditions, enabling additional layers to learn. We investigated sparse binary outputs, sparse ternary outputs, functions derived from sum square error and functions, functions derived from cell ranking at each location, and nonlinear functions of sum square error (see table below).
In all cases, we define a function f( e, m ) that returns a pooling input value v[z] for each cell z at each receptive field position x,y. These values are then pooled using a max operator over all spatial locations for a given z index.
Pooling function Details 1. Winner-Take-All Mask The best-matching cell is set to 1 and other cells to zero.
We output the max for each cell z in the pooling window. The output is therefore binary and of fixed sparsity.
if( z is the winner ) v[z] = 1
else v[z] = 0
2. Ternary Mask As above, but runner-up cells are set to 0.5. We apply a max over the pooled cells as above. The output is ternary and of approximately fixed sparsity.
if( z is the winner ) v[z] = 1
if( z is the runner up ) v[z] = 0.5
else v[z] = 0
3. Inverse Error Assuming that inputs are in the interval [0, 1], cell error can be normalized to a unit value:
v[z] = 1- ( sqrt( sumSqError[z] ) / numInputs )
4. Relative Rank v[z] = exp( -Q * rank( z ) / maxRank )
… where:
Q = 10, 12, 14
maxRank = Z (the depth of the convolutional network in cells)
5. Negative Log-Likelihood v[z] = min( 1, Q * -log( sqrt( sumSqError[z] ) / numInputs ) )
… where:
Q = 0.1
Convolutional parameters
These parameters were developed from a series of ad-hoc tests and with reference to successful configurations used in conventional convolutional networks applied to MNIST.
To ensure the results are comparable to the autoencoder algorithms, we tried to conserve the number of “cells” at approximately 1000 and the number of learned weights at approximately:
- Layer 1: Z=64 with 6x6x1 receptive field = 2,304 parameters
- Layer 2: Z=256 with 3x3x64 receptive field = 147,456 parameters
- Total: 149,760 learned parameters
This is a small number of parameters compared to modern deep networks, but the MNIST test problem also has small images.
Note that although the 2nd layer has a larger number of parameters, the inputs are very sparse given most of the pooling functions evaluated. We did not reduce the width and height of the top unsupervised layer to 1x1 because this function could be performed by the supervised classifier. Although a range of values were tested, the best configuration is shown.
Parameter name Best value[s] tested (layers separated by commas) Layers 2 Layer Depths (Z) 64, 256 Receptive field size (W,H) 6x6, 3x3 Receptive field stride (X and Y) 2, 1 Input padding 0, 0 Layer convolutional W and H 12, 4 Pooling window size 2, 2
GNG-U parameters
One of the nice qualities of the GNG algorithm is that almost all the hyperparameters are defined in terms of time rather than qualities of the data. In theory, this should make it especially easy to adapt to novel problems. Although a range of values were tested, the best configuration is shown.
Parameter name Best value[s] tested Growth interval 100 Edge Max Age 600 Weight Learning Rate 0.015 Neighbour Weight Learning Rate Weight Learning Rate * 0.2 Utility Learning Rate 0.01 Utility Threshold 0 Stress Threshold 0 Stress Learning Rate 0.005
Experimental results
We chose to test the convolutional GNG-U algorithm on the MNIST dataset with a variety of pooling functions. For comparison, we implemented and tested a K-Sparse Autoencoder with a Batch-Sparsity constraint (each cell must be trained on at least p samples from each batch, which are selected retrospectively).
The algorithms were tested in 2 phases.
Phase 1 — Unsupervised network training
For both algorithms we performed unsupervised pre-training of the GNG-U and K-Sparse representations for N epochs. Next, we disabled learning and exposed the network to 1 epoch of the MNIST training set (60,000 images) and 1 epoch of the MNIST test set (10,000 images). The output of the deepest layer was recorded as a matrix of features x images, in addition to the vector of image labels.
All layers were trained simultaneously in online fashion.
Phase 2 — Supervised Classifier
Finally, in all cases, a logistic-regression supervised classifier was trained on the unsupervised training-set output and then tested on the unsupervised test-set output. This provides a classification accuracy score. We considered using a more powerful classifier (e.g. nonlinear SVM) but this transfers more of the responsibility to the classifier and reduces our ability to measure the contribution of unsupervised classification, especially on the MNIST dataset where scores are already very high.
Classification Accuracy
The convolutional GNG-U algorithm in conjunction with a logistic regression classifier had a tendency towards consistently high training accuracy (often close to 100%) with more difficulty generalizing to high test accuracy.
Algorithm Pooling Function Best Training Accuracy (%) Best Testing Accuracy (%) Unsupervised training Epochs Conv-GNG-U 1. Winner-Take-All Mask 55.1% 50% 1 Conv-GNG-U 2. Ternary Mask 61.9% 55.5% 1 Conv-GNG-U 3. Inverse Error 88.4% 80% 1 Conv-GNG-U 4. Relative Rank 94.1% 88.2% 2 Conv-GNG-U 5. Negative Log-Likelihood 99.4% 97.0% 8 K-Sparse Autoencoder with Batch-Sparsity constraint*
(non-convolutional, can be stacked)
Non-convolutional.
Parameters:
Batch size: 32
Momentum: 0.9
Sparsity: 25
Batch-lifetime sparsity: 2
Sparsity output factor: 3x
99.56% 98.52%* 256 K-Sparse Autoencoder
(Makhzani2013)
w. Logistic Regression layer
n/a n/a 98.65% 5000 Conv-WTA Autoencoder (Makhzani2015)w. Logistic Linear SVM layer Max hidden layer values within pooled area n/a 99.09% n/a Stacked Conv-WTA Autoencoder (Makhzani2015)w. Logistic Linear SVM layer Max hidden layer values within pooled area n/a 99.52% n/a
* Results from our Java re-implementation of the K-Sparse autoencoder with batch-lifetime-sparsity constraint from the later Conv-WTA paper. Since our implementation is written from scratch in Java without use of thoroughly tested third-party libraries, small numerical issues could account for the slight drop in performance.
Conclusions
The MNIST dataset is very well studied, making it good for comparative results; however, it is also “too easy” which results in good algorithms being clustered in the 98–99% accuracy interval (at some point, higher accuracy just means overfitting to the dataset as maybe 0.1–1% of the digits are extremely distorted and arguably unrecognizable).
Despite a large hyperparameter search, we were unable to get state-of-the-art performance from a convolutional GNG-U algorithm on the MNIST dataset (best result: 97%). Under the same unsupervised training regime with a supervised logistic regression classifier, we were able to reproduce approximately state-of-the-art performance (98.6%) with a k-sparse autoencoder despite a complete re-implementation of the entire backpropagation framework in Java, without optimizations.
The choice of GNG pooling function is critical. The simpler binary or ternary pooling functions probably do not contain enough information to make accurate classifications in ambiguous examples. We attempted to use more sophisticated encodings without introducing data-dependent hyperparameters, with reasonable but inadequate results.
Interestingly, the GNG algorithm family train very rapidly to a good performance level, but do not benefit from slower training over a larger number of epochs. In addition, the training set classification accuracy is very high in many configurations (over 99%), but does not generalize well to test accuracy. We tried dropout and greater regression regularization to reduce overfitting to training data, without improvement.
It is unsatisfying that we were not able to explore why a convolutional GNG-U is unable to benefit from more & slower training, like a convolutional autoencoder using stochastic gradient descent does. However, without promising results we can’t justify further investigation.
One possibility is that the lack of batch-training makes it harder for GNG-U to learn without a strong recency bias. Learning rate is normally balanced against batch size to enable effective training. With our GNG-U implementation we only have one of the two levers to control (learning rate, vs learning rate and batch size).
Another likely possibility is that the pooling functions tested did not effectively capture the available information in the GNG. However, we tried a variety of obvious functions given the available data, attempting to balance signal and noise, without success.
Most likely the root cause of the representational limit is the nature of the GNG-U algorithm which requires a fixed (winner-take-all) sparsity and does not variably distribute representational responsibility between cells like a k-sparse autoencoder does. On the face of it, this seems like a fundamental constraint that limits the flexibility of the approach and might explain the reduced classification performance.
We performed ad-hoc tests of the nonstationary capabilities of GNG-U by initially withholding half the digit classes and confirmed it works as advertised; the algorithm returns to optimal classification performance after all digit classes are restored.
So, can convolutional Growing-Neural-Gas (representing a family of competitive learning algorithms) compete with modern sparse autoencoders? No.
Resources
Source code for the implementations described here can be found in our Java repository on Github:
Contributions
David Rawlinson and Gideon Kowadlo wrote the Java code for the algorithm implementations, executed and supervised experiments. Abdelrahman Ahmed set up and executed many of the experiments, and analyzed results.
References
Fritzke1997 — “A self-organizing network that can follow non-stationary distributions”. Bernd Fritzke. Proceedings of International Conference on Artificial Neural Networks, p. 613–618 (1997)
Makhzani2013 — “k-Sparse Autoencoders”. Makhzani, Alireza and Frey, Brendan. arXiv preprint arXiv:1312.5663 (2013)
Makhzani2015 — “Winner-Take-All Autoencoders”. Makhzani, Alireza and Frey, Brendan J. Advances in Neural Information Processing Systems (2015)