Topological Data Analysis and Guacamole Cats

Change one pixel of a photograph, and you can fool even the best computer-vision algorithms. The solution: Building models that don’t fall for those tricks, and that can generalize far beyond what you’ve taught your algorithm.

Alex Georges
GAMMA — Part of BCG X
11 min readApr 7, 2020

--

This article is laid out in four sections, which take us along a journey to discovery: description of the problem → description of the tool → how this tool can solve the problem → further implications.

Note: An in-depth mathematical treatment of some of the ideas presented here can be found in our published research paper on IEEE (alternatively, on arXiv).

SECTION I: The Problem

It’s another bright sunny day in San Diego and you’ve decided to walk downtown and grab some lunch. On the way there, you come to a four-way stop. Unbeknownst to both you and the self-driving car barreling down the street toward the intersection, the stop sign just ahead has been altered:

While the pithy comment spray painted on the sign doesn’t hide from you the fact it’s still a stop sign, the story might be completely different for the car. The algorithm running the autonomous vehicle could misinterpret this object in any number of ways: It could see this stop sign as a cat or an airplane or, if it’s an especially smart car, just an “other” object.

The fact is that you can fool current state-of-the-art computer-vision algorithms by changing a single pixel in a photograph (that’s right, a single pixel!). Of course, this kind of alteration would be completely imperceptible to a human. We’ll refer to these kinds of examples as “adversarial examples.”

Here’s an actual adversarial example constructed by researchers at MIT, in which we see NN outputs that show the probability distribution over classes for the input images:

Cat → Guacamole. Image courtesy of MIT researchers.

This article presents a novel approach and application to constructing robust classifiers that are not easily fooled by adversarial examples. More broadly though, this approach could be used for any type of classification problem that require some forgiveness in either the data or the parameters of the model. Said differently, if you need an algorithm that can generalize far beyond what you’ve taught it, our TDA+ML approach might work well for you. This approach has the potential to be much less biased than current ML algorithms. These algorithms tend to do a good job of learning precise mathematical structures. But they have a hard time generalizing beyond this. (We’ll leave a discussion of the ethical concerns inherent to ML algorithms for another article.)

As a consequence, this approach has the potential to be much less biased than current ML algorithms.

The approach presented here is based on research I did while working on my physics PhD. I presented my research at the IEEE ICML 2019 conference, and it was published in the proceedings. In particular, this approach is based on tools from topological data analysis (“TDA”) and is used in conjunction with conventional machine learning methods. As a consequence, this approach has the potential to be much less biased than current ML algorithms.

Topological data analysis, which is based on algebraic topology, can identify significant global mathematical structures that are out of reach of many ML methods focused more on the specific structure. Without an extensive mathematical background, it can be difficult for readers to follow a purely mathematical explanation of TDA. I will, instead, opt for a more intuitive exposition. For the mathematical details, please see:

The novel combination TDA+ML results in a robust algorithm that resolves some fundamental issues present in other current state-of-the-art methods. We’ll analyze this performance in the context of noisy image classification.

SECTION II: An Overview of TDA

Let’s briefly review TDA and its motivations from both mathematical and human-based learning perspectives.

Mathematical Motivation

Many machine learning methods are fundamentally built upon notions from geometry including length, area, volume, local curvature and others. These geometric concepts can be thought of as inferring the structure of the data by building upon local information. This usually means that if we perturb the data in some way, or change some of the parameters in the workflow, the outcome of the algorithm may completely change.

If we think of geometry as a bottom-up approach to understanding structure, topology can then be thought of as a top-down approach: We start by looking at the object’s whole structure, then infer information from this. If you ask a topologist to describe an object with a hole in it, he or she might just say the object is in one piece and that there is an empty space in the object, which is considered to be a two-dimensional hole. To a topologist, a sphere is only a single object that has a hole in it — nothing more, nothing less.

To a topologist, a sphere is only a single object that has a hole in it. Nothing more, nothing less…

Rather than just using geometric or topological based approaches, we can benefit immensely when we ensemble both approaches to leverage both specificity and generality. This is the power of using TDA+ML.

In general, the bias-variance tradeoff typically requires us to give up specificity if we want to have an algorithm that’s great at generalization. But by using TDA+ML we can resolve this tradeoff. In other words, rather than just using geometric or topological based approaches, we can benefit immensely when we ensemble both approaches to leverage both specificity and generality.

TDA is a relatively new field of data science that has emerged out of ideas from pure mathematics. Although TDA may seem too abstract or contrived to have any real utility, the tools associated with it can be extremely powerful in helping us understand real-world data. These tools come with three fundamental properties that make TDA so powerful:

  1. Compressed Representation
  2. Coordinate Invariance
  3. Deformation Invariance

Compressed representation can be thought of as a summary of the data, distilled to its most important structure. Because of this property, TDA is often used solely as an exploratory data analysis tool, since we can visualize this distilled structure.

But what if we want to use a more systematic approach than exploration does and to also put more weight on the other two properties? At this point, we’re led very nicely into the world of classification. By placing TDA into some of the steps along the classification process, we can incorporate the other two fundamental properties, coordinate invariance and deformation invariance. These properties are important for a wide class of problems, including computer vision.

Coordinate invariance means our results should not change based on the coordinate system we choose to impose on our problem. Deformation invariance means we can stretch and modify the space without changing the results. If you ask a topologist whether there is any difference in the mathematical structure between a coffee cup and a donut, they will deny any such difference.

Below, we can see what this deformation invariance means for a continuous shape.

This is deformation invariance at its tastiest: Coffee cup = Donut. In this example, the structure is continuous, as opposed to a discrete data space.

An implication of deformation invariance is that we don’t need precisely defined data. This means that we can potentially limit inherent biases that are harder to avoid in other methods that learn only on precise structure.

Human-Based Learning Motivation

In general, error can be decomposed into two parts: (1) the error coming from being too specific (also called having a high variance) and (2) the error coming from being too simple (also called having a high bias). The TDA+ML approach can be thought of as a means to minimize both the bias and variance errors by capturing first global, then local structure in the data. In other words, we will produce an underfit model in this approach due to TDA coarse graining. But we will regain complexity by utilizing an ensemble approach in conjunction with conventional classifiers.

It appears that humans also resolve the bias-variance tradeoff by building up a complex system with simple heuristics in a “less-is-more approach.” Neural nets and random forests, on the other hand, typically start out overly complex and are made more simple. This difference in approach may explain why machines can perform much worse than humans at certain tasks.

Some issues specific to neural networks include:

  • The ability of convolutional neural nets to perfectly fit random noise, which is theoretically consistent with the universal approximation theorem
  • The existence of adversarial examples for neural nets, which can be thought of as minimally intrusive ways to trick the nets
  • The necessity of many data points to train the nets
  • The computational intensivity of training the nets
  • The difficulty in optimizing the nets

These issues point to the fact that, although deep learning and ML methods have been widely successful, they process data differently than do humans.

SECTION III: Using TDA to Solve the Problem

Our task is to build a robust classifier based on TDA+ML. When we use this classifier for computer vision problems, “robust” means that it won’t typically be fooled by adversarial or noisy examples. More so than this, it should be robust to enough to manage all sorts of noisy examples. CNNs, for example, have to explicitly build-in translation invariance, which is done by the convolutional architecture. We don’t want nor, in the real world are we able, to specify every type of perturbation our algorithm might see. If we did, we would then have to build in that specific invariance. A beauty of using TDA is that it should be noise agnostic — unaffected by how we perturb the data. Remember, this property comes fundamentally from its deformation invariance.

The TDA approach we’ll use for this treatment is based on the mapper algorithm. We’ll briefly walk through this algorithm, then discuss how it can be used in conjunction with neural nets to produce robust computer vision systems.

Image from Soudot’s excellent discussion of mapper.

Step 1: Start with your data, X.
Step 2: Map X to ℝ, using a function of your choice. The function f, referred to as a lens, effectively encodes some unique aspect of the data.
Step 3: Choose any arbitrary open cover for ℝ so that it covers the entire range of f (i.e. the blue, green, red, purple).
Step 4: Find all the points that land in each interval. For instance, find all the points that map to Green. In other words, compute the inverse of these points.
Step 5a: Cluster the inverse image from step 4, such as by finding clusters of the green points in X.
Step 5b: One way to accomplish 5a is to define a parameter (𝛅) that determines if two points are close or far in X.
Step 6: Represent all points in a cluster by a single node in the graph (i.e. the green nodes in the mapper).
Step 7: Connect nodes if there is at least one shared point between them in the inverse images (i.e. the left green cluster’s bottom intersects with the left red upper arm).

The intuition here is that if we deform the original space X in some way, the final mapper object should remain relatively unchanged due to deformation invariance.

To further play with this process interactively, check out the amazing open-source TDA project below. They have excellent code, excellent documentation, and excellent research.

The fact that these mapper objects capture topological information in data is out of the scope of this article, but it can be mathematically proven in general. This means that the mapper objects ultimately inherit from TDA the three fundamental properties we require: compressed representation, coordinate invariance, and deformation invariance.

With this in mind, we can go over the general details of creating the robust classifier using these mappers (as mentioned above, you can find the mathematical details in our published research paper on IEEE). The general procedure is as follows:

  1. Repeat the mapper algorithm numerous times to produce different mapper objects. If you have a background in CNNs, think about each of these objects as a different kernel, each capturing distinct information from the data.
  2. Collect all these mapper objects together and represent the data by where it lands in each of the nodes. This ultimately provides an entirely new data space to work with, one which has encoded topological information along with the class labels.
  3. Run this new data space, along with the labels, into a classifier. Again, if you have a background in CNNs, the mapper objects can be thought of as entirely replacing the convolutional layers. In the CNN, these layers can blow up tiny perturbations (hence, the adversarial examples). In the mapper layers, little perturbations are bounded in the effect they’ll ultimately have.

Very roughly speaking, this is how the end classifier would think:

Roughly this is how our algorithm thinks. It groups different instances of the same class in nearby nodes — even if an instance has been perturbed.

Our approach is to first construct a trained mapper object that might look like the above, then train a classifier on it. When a new test data point comes in (let’s say it’s a 3), it will land in the grouping of nodes where the other 3s are. This is the case even if the test 3 has whatever noise you throw at it.

At this point, you might be thinking: “Okay, I absolutely loved the math theory, but I’m still curious as to how it performs on actual data.” To which I will respond: “Great, I loved the theory too, and here you go!”

The test accuracy of our methods vs the CNN with respect to Gauss Blur perturbations in the MNIST data. All mapper classifiers are significantly better than the CNN (shown in purple). We get a 58% increase in performance, which will only be improved by increasing the noise.

If you skipped absolutely everything in this article and just landed here, you would probably see the purple curve and say “oh, that looks bad.” And indeed it is — that’s what we get with a CNN. On the other hand, all the other curves are mapper-classifier based (each of which differs only in how we choose to project the data) — and all of them completely outperform the CNN. Pretty exciting, right…?

SECTION IV: Summary

I have left out a great deal of technical details in the above explanation, including the precise mathematical formulation and an exposition of all the parameters/hyperparameters involved.

At this point, I mainly hope to have related that:

  • TDA can be useful in situations where we need to allow for “stretchiness” in the data or in the parameters to our data science approach.
  • The algorithm we developed is a general robust classifier (we also applied the classifier to Fashion MNIST and obtained similar results).
  • Mapper objects ultimately provide this robustness by capturing global information about the data space, so perturbations leave the structure relatively unchanged.
  • An important consequence is that TDA approaches don’t care about the direction of the perturbation. In other words, they are noise agnostic.
  • TDA can be used when we just have a notion of similarity in the data, as opposed to other methods that require us to have precise descriptions.
  • Because we don’t need precisely defined data, and because TDA focuses on broad structure, we can potentially limit inherent biases that are harder to avoid in other methods. This can possibly help us walk the path of a more ethical data science.

Data science is becoming more ubiquitous in almost all aspects of life, and its transformative potential has been proven time and time again in widespread domains. Hopefully the work presented here will, to some extent, push the boundary of how we interact with data science.

--

--

Responses (1)