On The ‘Real’ Dimension Of Your Data

Sean McPherson
Phase Analytics
Published in
5 min readOct 17, 2017

This post is going to be a little off topic, a little theoretical, but hopefully somewhat interesting and novel. One of my favorite things is discovering concepts from different fields that are essentially two ways of stating the same idea. I enjoy this because I find it fascinating to compare how the similar idea was derived in the different fields, and I enjoy seeing our seemingly infinitely complex world being explained through latent patterns that suggest an underlying simplicity or structure.

Recently I came across a new technique in data analytics called topological data analysis, and discovered that one particular theory from topological data analysis was very similar to one of my favorite topics from graduate school, compressed sensing.

Topological Data Analysis

I discovered the field of topological data analysis when I was introduced to the Ayasdi data platform. Topological data analysis is a way to define structure in high dimensional datasets. In a talk by Ayasdi data scientist Allison Gilmore at MLConf 2015, she describes that topological data analysis can, for certain types of datasets, beat the “curse of dimensionality”. The “curse of dimensionality” is common in data science, where as the number of features, e.g., the dimension of your vector space, increases the distance between points that are similar increases exponentially. The curse of dimensionality is one reason why it took the massive amounts of data that we are now encountering for machine learners to begin to show what they can do.

How can topological data analysis beat the curse of dimensionality? It stems from a theory on lower dimensional manifolds embedded in high dimensional spaces described in a paper by Niyogi, Smale, and Weinberger[1]. The theory as Allison describes it in her talk is:

If a dataset is supported near a manifold, its key topological features can be detected from a sample whose size is independent of the dimension of ambient space.

Essentially what this means is that if you have a lower dimensional manifold, e.g., three dimensional sphere, and you embed this manifold in some larger space of dimension N the number of samples needed to learn the properties of the sphere are proportional to the dimension of the sphere NOT the dimension N. Thus, even if N grows you still only need a fixed number of samples to learn the properties of the sphere.

Compressed Sensing

the recovery of sparse signals from a few carefully constructed, and seemingly random measurements.

I appreciate the above description. How can measurements be carefully constructed yet seemingly random?

Compressed sensing is a field in digital signal processing that utilizes a particular structure of a digital signal to seemingly violate the Nyquist-Shannon theorem. The Nyquist-Shannon theorem is the most fundamental law in digital signal processing, and states that in order to accurately reconstruct an analog signal from a sampled representation of that signal you must sample at twice the maximum frequency you want to sample. For example, if you want to create a digital signal by recording a human voice, which has a max frequency of about 4 kHz, you need to sample at twice that or 8 kHz.

Compressed sensing was developed by mathematicians at Stanford. In 2004 Emmanuel Candès, Terence Tao, and David Donoho, proved that given a signal that was k-sparse in some basis the signal can be recovered via m > k samples, even when m is less than the number of samples dictated by the Nyquist-Shannon theorem [2, 3]. The important concept here is k-sparse. A signal is described as sparse if, in some basis, the number of non-zero values is much less than the dimension of the basis. The simple example of a sparse signal is a spike train as shown below, where all values are zero except for a few. Spike trains are commonly used to represent neuronal activation.

Other sparse signals include sine waves with only a few frequencies, which are sparse in the Fourier basis. Additionally, certain black and white images that only have a few transitions from black to white are sparse in the basis that measures transitions. This application has been applied to medical images such as CT and MRI images with success.

CT Brain Scan is Sparse in number of transitions from black to white

Compressed sensing, also know as compressed sampling, can be seen as solving an undetermined system of linear equations. In the example shown below assume we have the measurement y that is a sampled version of the signal x. Typically, undetermined systems of linear equations have infinitely many solutions, however, under certain constraints such systems have a unique solution. Sparsity is one of these constraints. Thus, if the signal x is k-sparse, and the dimension of y is > k it is possible to find a unique solution to the system of linear equations.

In [3], Candès and Tao showed that by using a Gaussian random matrix for D one can accurately reconstruct x. This is a fascinating result, from a few random projections of the sparse vector x one can reconstruct the original vector.

How does this relate to the previous theory from topological data analysis? While these sparse signals may not necessarily lie on a lower dimensional manifold, they can be seen as embedding a lower dimensional signal in a higher dimensional space. Thus both theories show that the number of measurements necessary to describe the features of a signal are related to the dimension of the inherent feature space and not the dimension of any higher dimensional space that those features are embedded in.

To summarize from a data science perspective, if we have a dataset that can be described by some intrinsic, fixed dimension feature space, and we continue to add dimensions to our space the number of samples in our dataset needed to learn the underlying feature space depends on the size of our intrinsic space and not the size of the measured feature space. This is a promising result because surprisingly many problems that data scientists are interested in can be described by lower dimensional spaces.

Hopefully this brief foray into two abstract mathematical concepts was useful, enjoyable, or at least not painful. I was excited to write this to connect topics that are similar in concept.

The timing of this post is also rather fortunate as I was lucky enough to attend the USF Data Institute Conference this week where one of the tracks featured Compressed Sensing, which is making an impact in data science and benefiting from new methods being developed in machine learning.

Additionally, the latest set MacArthur Fellows were recently announced and Emmanuel Candès was selected as a fellow for his work in compressed sensing and statistics.

[1] A Topological View of Unsupervised Learning from Noisy Data, P. Niyogi, S. Smale, S. Weinberger, SIAM J. Comput., 40(3), 646–663. (18 pages). Link

[2] Compressed Sensing, D. Donoho, Department of Statistics, Stanford University. Link

[3] E. J. Candès and T. Tao. Near-optimal signal recovery from random projections: universal encoding strategies. IEEE Trans. Inform. Theory, 52 5406–5425. Link

--

--