When Big Data Goes Local, Small Data Gets Big

Kirk Borne
6 min readSep 15, 2021

--

Swiss Jelly Roll Manifold. Source: http://mdp-toolkit.sourceforge.net/examples/lle/lle.html

In an earlier article “The Importance of Location in Real Estate, Weather, and Machine Learning,” various meanings and applications of location-based discovery in data science and machine learning were discussed. One algorithm described there is a powerful but strangely named machine learning algorithm: the Support Vector Machine (SVM).

In the remarks below, we summarize the significance and utility of another powerful but strangely named machine learning algorithm that focuses on location: Local Linear Embedding (LLE). LLE is a specific example from the general category of Manifold Learning algorithms. The most famous example of manifold learning with LLE is the Swiss jelly roll example (illustrated above). Learn how to model that case with scikit-learn here.

General Remarks

Before we look specifically at LLE, you might be wondering what percentage of machine learning algorithms have strange names like this, and you would be surprised and/or amused to discover that most of them do, as a quick perusal of the article titles in the Journal of Machine Learning Research reveals.

In fact, we are not innocent in this regard — our own work on Novelty / Outlier / Anomaly Detection yielded our own contribution to the eclectic algorithm universe: KNN-DD = K-Nearest Neighbors Data Distributions. You can even find a research paper online that combines the complex manifold embedding, unfolding, and boundary-discovery classification capabilities of LLE plus SVM here: “Embedding Propagation: Smoother Manifold for Few-Shot Classification.”

Source: https://www.alexdrouin.com/publication/rodriguez-2020/

KNN-DD: K-Nearest Neighbor Data Distributions

In our KNN-DD algorithm for novelty detection, which we prefer to call surprise discovery, we define a surprise (or anomaly, or novelty) as a point whose behavior (i.e., whose location in the multi-dimensional parameter space of our big dataset) deviates in an unexpected way from the rest of the data distribution—that surprising location may prompt us to say, “that’s funny!” Such surprises may be the most important things in your data that need your attention. As Isaac Asimov said, “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka! ‘ but ‘That’s funny…’.”

Our algorithm evaluates the local data distribution around a test data point and compares that distribution with the inter-point data distribution within the surrounding sample defined by that data point’s K nearest neighbors (not including that data point). Since the KNN-DD algorithm focuses only on the local data (refined from the larger dataset) in the neighborhood of the test data point, then this local approach to big data is essentially empowering big surprise discovery from small data.

LLE: Local Linear Embedding

LLE localization infers the true global structure of the data by analyzing local segments of a complex hyper-dimensional data space. In some cases, LLE may be the only way to uncover truly complex interdependencies and interrelationships within high-dimensional data (as illustrated in examples shown here).

Source: https://www.researchgate.net/figure/Clockwise-from-top-right-Helix-Broken-Swiss-Roll-Twin-Peaks-Swiss-Roll4_fig3_228416827

LLE helps us to solve a particular type of problem that occurs when we attempt to build predictive models — specifically, the awkward situation in which we discover that apparently the same set of inputs (independent variables) lead to completely different predicted output values of the dependent variable. Mathematics calls this a multivalued function. In order to grasp how this could happen, take some time to examine the graphic below, which unfolds (unrolls) the Swiss jelly roll, thus visualizing the solution and resolving the apparent contradiction implied by the statement at the start of this paragraph.

Source: https://medium.com/analytics-vidhya/under-and-over-autoencoders-3d695f428c1a

When we learn a predictive model f(x,y) from our data (for example, from data values {x,y}) such that the model predicts z=f(x,y), then that model function should (hopefully) predict just one output value for z from one set of inputs {x,y}. That is what we call a single-valued function. However, that is not true in the LLE examples that we examined earlier. Why? Because those data distributions represent multivalued functions: multiple different values of z correspond to the same pair of input values {x,y}. This occurs simply because there is actually another independent variable (another feature, which may not be known yet, which is called a latent or hidden variable) that corresponds with the location along the natural hyperplane (the curved surface, or manifold) that holds the data points.

LLE is an example of a topological approach to exploratory data analysis. Another example is Topological Data Analysis (TDA). TDA is used by the company Ayasdi to analyze geometrically complex datasets. Discovering and making use of the natural “shape” of the data distribution is essential for effective analytics and data-driven decision-making.

So, in a nutshell, how does LLE work? Basically, it examines the structural distribution of data points in very localized regions in order to find the natural directions in which the data percolates away from that region. The percolation path will follow the natural surface of the data distribution, and will not “jump the gaps” (e.g., in the vertical direction in the LLE diagram previously discussed above). (Note: Percolation is a mathematical concept that focuses on discovering long-range connectivity in large systems, as determined in small steps through the local structure of the network.)

An interesting aspect of the manifold (surface) learning process (in either LLE or TDA) is the fact that the semantically correct distance metric between two data points is the distance along the manifold (i.e., the geodesic distance along the data surface). The correct distance is not the apparent distance (the Euclidean distance) in the (x,y,z) coordinate space of measured features. As we can see in the diagram below, distance and similarity calculations can be wildly wrong if we do not take proper account of the “shape” of our data distribution.

Source: https://indico.cern.ch/event/967970/contributions/4118959/attachments/2151681/3628080/Burnaev_Manifold_Knowledge_Transfer_v2.pdf

The true interdependencies, associations, trends, and correlations within our data collection are traced out by the manifold (data surface) learned by manifold learning (LLE in this case). Consequently, it is theoretically possible that two points A and B that are right on top of each other in (x,y,z) coordinate space may in fact be very far apart on the natural hypersurface of the data space. This means that any similarity metric that calculates the similarity between those two points A and B will need to give a very low value for the similarity. Likewise, any distance metric between A and B must reveal that there is a large distance between A and B (in the latent natural coordinate space of the data).

Since distance and/or similarity metrics are required in essentially all machine learning clustering algorithms as well as in some classification algorithms (e.g., K-nearest neighbors), it is imperative to discover the natural shape of the data in order to develop and apply correct and meaningful distance and similarity metrics.

Summary

In the end, focusing on very small local regions of a massive hyper-dimensional dataset ultimately enables the correct clusters, segments, categorizations, and classifications of the data points to be assigned. That makes “small data” a very big deal in such complex data distributions.

So, when we get local with our big data by concentrating on the behavior of objects in smaller localized units, we have the potential for significant discoveries from those small data subsets. Therefore, don’t get distracted by all of the talk about big data’s bigness. You can go local with big data, and get big results from small data.

Finally, as an added bonus, it is refreshing and important to note that Small Data and Wide Data (hyper-dimensional data) were listed by Gartner in their Top 10 Data and Analytics Trends for 2021.

Source: https://www.gartner.com/smarterwithgartner/gartner-top-10-data-and-analytics-trends-for-2021

Follow me on Twitter at @KirkDBorne

Learn more about my freelance consulting / training business: Data Leadership Group LLC

See what we are doing at AI startup DataPrime.ai

--

--

Kirk Borne

Kirk is Advisor & Chief Science Officer at AI startup DataPrime, and founder & owner of Data Leadership Group LLC: provides speaking, training, consulting, more