How to Perform Supervised Random Projections with Light

7 min readApr 23, 2020

A delicious cocktail I drank a few months ago

Today, we’re going to talk again about supervised random projections (SRP) [1]. The regular reader may remember our article on the subject. For the others, here is a quick reminder.

SRP is a dimensionality reduction technique that takes into account not only the data features but also the corresponding labels, when they are available. A simple and popular dimensionality reduction technique is Principal Component Analysis (PCA). PCA is label-agnostic, but it has a supervised counterpart: Supervised PCA (SPCA) [2]. Besides, some of you may know that random projections can often replace PCA for a much smaller computational cost, thanks to the Johnson-Lindenstrauss lemma. SRP is the next logical step: accelerating SPCA with random projections. Furthermore, all of those have non-linear counterparts using kernels: KPCA, KSPCA, and KSRP.

In the article linked above, which gives much more details on the inner workings of all this, we attempted SRP with a LightOn Optical Processing Unit (OPU). This post gives you the details you need to reproduce this — not so — legendary feat.

From Formulae to Code

There are often surprises in how a paper is actually implemented in practice. What’s mathematically convenient to write rarely corresponds to efficient code, and vice-versa. Those two sentences don’t describe this paper, however: it’s pretty straightforward to code. Or it would be, if not for a major sin: the authors use the (features, datapoints) convention for data matrices. Statisticians sometimes do so; they are forgiven. Still, I had to change this in my implementation: practical machine learning works with the (datapoints, features) convention, period.

Machine learning has several advantages as a field: open research, lots of funding, great open-source software, and some useful conventions such as this one. As practitioners and researchers, it is our duty to defend those against negative external influences. I played my role as a brick in the wall against the (features, datapoints) convention, because I’m not the hero ̶G̶o̶t̶h̶a̶m̶ the machine learning community needs, but the one it deserves.

We’ll work mostly in a CPU environment, using scikit-learn and NumPy, as well as LightOnML, the Python API for the LightOn OPU. Let’s start with SRP. In the paper, with the — disgusted tone — (features, datapoints) convention, the projection matrix we’re looking for is U=XHΨᵀ, where X is the data matrix of size d x n, with n the number of samples and d that of features. H is the centering matrix of size n x n, and Ψ is a matrix of shape k x n, with k the dimension we are projecting to, containing random features that approximate a kernel applied to the labels. We’ll come back to it. For now, let’s convert the formula to the — angels’ choir — (datapoints, features) convention. First, we transpose everything to obtain a projection matrix of shape k x d: ΨHᵀXᵀ, and then we remember that H is symmetric and that we actually use the transpose of the other two matrices, leaving U=ΨᵀHX.

What about H and Ψ? H is a special matrix that’s useful in equations: HX is X minus its mean row (and XH would be X minus its mean column). In NumPy, it’s much clearer and more efficient to write Xc = X - X.mean(0). Ψ contains the random features. In a classification setting, we want the same labels to be close together, so we use the Delta kernel, which is 1 when two entries are equal and 0 otherwise. Or that’s what we do in SPCA. In SRP, the delta function is approximated by a very narrow Radial Basis Function (RBF, a Gaussian). Scikit-learn offers the sklearn.kernel_approximation.RBFSampler, that we use with a large gamma to obtain narrow RBFs. You can do that and go on with your life. Or you can use an OPU with these three lines instead and, like Gandalf, be reborn from light:

Am I slightly overstating things? Totally, but you could also stop being a killjoy and appreciate that you just used light to approximate a kernel.

That’s it for SRP: now we multiply Psi.T and Xc using my favorite matrix multiplication operator @ and wrap that in a class that fits well into scikit-learn’s pipelines. Et voilà!

KSRP will be quick, now. In the paper, the formula is ΦHΨᵀ (I changed the notation slightly to accommodate Medium’s dreadful lack of typesetting features). It translates to ΨᵀHΦ in the (datapoints, features) convention. Φ is obtained in the same way as Ψ, but with the random features applied to the feature matrix X instead of the labels. Note that they generally approximate a different kernel than that used for the labels. The rest is the same as for SRP, except that X is replaced by Φ. In my code, I also added the option of using the exact kernel matrix K for the features and approximating that of the labels with SRP, in which case the formula is KHΨ.

Nitpicking

What’s under the rug that I’ve been standing on so far? I tried to protect you, but you deserve to know. I’ll move aside to let you take a peek: the entries of the OPU must be binary so the labels and features need to be encoded before being fed to it, and the output is an integer between 0 and 255 so it might need a slight post-processing to yield good-looking visualizations. Now, can you stop looking under the rug? It’s weird.

The good news is that LightOnML has several encoders that should suit your needs most of the time. Because we’re not used to projecting labels, I implemented a special one for SRP that is similar to one-hot encoding but adapted to our OPU. I called it… onehotopu!

As for the post-processing, it’s actually just a rescaling, for instance dividing the output by 255 to get something in [0,1]. It can also be a cosine function for something more similar to the Random Fourier Features algorithm of Rahimi and Recht [3].

Finally, I found that, in some cases, not centering the data yields better-looking visualizations. The centering seems to sometimes induce a linear dependency between the columns of the results, which results in awful plots with points on a straight line. The reason would be obvious in 2D if the mean column was removed, but here it’s the mean row. Never mind: I added an option to not center the data when that happens.

This is the end result:

You’ll find the code for the other techniques (SPCA, KSPCA, KSRP) in the GitHub code accompanying this post.

Use Case: Visualization

I’ve told you how to code SRP, but we need a use case. The easy one is visualization. What cool example could we take? Ah, yes, I have the perfect one: the one that requires no additional work! My initial thought was of MNIST, but it looked terrible. A few convolutions in pre-processing would have been needed, I suppose.

Behold, the sonar dataset! It was used by the authors, and by myself in my first post on the subject:

Because I show the result of every technique in this plot, I just have to show you how to draw it and you’ll know everything. Isn’t that marvelous? Here’s the code:

Do you wonder why I created the dictionary one entry after another like this? No? Please don’t ruin my rhetorical question and let me answer it for the curious minds. You see, SRP and KSRP are randomized algorithm, so the result will change at every run if you don’t control the random seed. They often need to be run a few times before yielding satisfying visualizations. Thus, it’s convenient to run this code in a notebook and fill the dictionary key by key in separate cells. This way you can rerun any of them until you’re satisfied with the result.

Ok, I think I’ve taken enough of your time. Let’s wrap-up! We provide you with a unified Python implementation of SPCA, KSPCA, SRP, and KSRP, available on GitHub. SRP and KSRP can be optionally run with an OPU with a few additional lines of code. Like I said in the original post, the OPU is likely not needed in such a low-dimensional setting. But remember: the computation time of random projections with the OPU does not depend on the input dimension! This property will be particularly useful if you’re working with high-dimensional labels (SRP) or features (KSRP). So, if you encounter a high-dimensional problem, don’t hesitate to try out a LightOn OPU. You can do so on LightOn Cloud.

Author

François Boniface, Machine Learning Engineer at LigthOn AI Research

Aknowledgements

Thanks to Amélie Chatelain, Iacopo Poli, and Victoire Louis for reviewing this post and tolerating the lame jokes.

About Us

LightOn is a hardware company that develops new optical processors that considerably speed up big data computation. LightOn’s processors open new horizons in computing and engineering fields that are facing computational limits. Interested in speeding your computations up? Try out our solution on LightOn Cloud ! 🌈

Follow us on Twitter at @LightOnIO , subscribe to our newsletter and register to our workshop series. We live stream, so you can join from anywhere. 🌍

References

[1] Karimi, A. H., Wong, A., & Ghodsi, A. (2018). SRP: Efficient class-aware embedding learning for large-scale data via supervised random projections. arXiv preprint arXiv:1811.03166.

[2] Barshan, E., Ghodsi, A., Azimifar, Z., & Jahromi, M. Z. (2011). Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognition, 44(7), 1357–1371.

[3] Rahimi, A., & Recht, B. (2008). Random features for large-scale kernel machines. In Advances in neural information processing systems (pp. 1177–1184)