Topological Data Analysis for Practicing Data Scientists

How TDA can help you capture local and global properties of your data in exciting new ways.

Joe Tenini, PhD
Red Ventures Data Science & Engineering
20 min readSep 6, 2022

--

Photo by Nacho Díaz Latorre on Unsplash

The goal of this article is to provide a basic and self-contained introduction into a topological perspective on data. By the end of this article you’ll have intuition, a formal understanding of the basics, as well as some practical examples to get the wheels turning to unlock new value from data within your area of interest.

For me, the motivation for this topological line of thinking comes from a trick I picked up in my past life as a geometer. This trick is distinguishing in your mind whether your thought process is a “local” or “global” perspective. Is your function differentiable? This can be checked locally — a function is differentiable when it is differentiable at each point. Is your space connected? This is something that cannot be checked locally. If your space is two disjoint spheres, a neighborhood of each point will look connected. To tackle a concept like connectedness you need to consider the relationship between your local pictures of the space.

If you work in data science, chances are you feel pretty comfortable thinking in terms of the “local geometry” of your data — given a user, what article should your recommend? what ad suits their personal taste best? what products would they be most interested in? Or — more generally — what point in my parameter set minimizes this loss function! This is a very useful perspective, but it’s still a limited one — in particular, it leaves out the “global geometry” of the data as a whole. Questions about the overall shape of your data are harder to answer with the traditional ML tool kit- what is the shape of a user journey? what content sequence gets a user ready to buy a new house? What are the gaps for my site? These questions require a new perspective — one provided by TDA.

Setting the stage

Most of the data science done in the current corporate setting concerns itself with the analysis of a finite collection of points in a vector space. Granted, there are (sometimes many) difficult things to do to take your data into the realm of a vector space — particularly, when your data consists of text, images, video, or sound. However, your average data science team is unlikely to be adding the bulk of its value by creating novel ways to encode or embed data as a vector. It is much more likely (though by no means unheard of) that they are leveraging open source tools to do this embedding.

With this in mind, as the starting point of our data science process, consider a finite set of points in a finite dimensional vector space.

There are two common non-technical business asks associated with this setup:

  1. (Supervised Learning) Choose a function that predicts certain coordinates of points from other coordinates. Given a user’s data, tell me their lifetime value, their risk of churn, or which content they’re most likely to read.
  2. (Unsupervised Learning) Identify structure present within the set of data. Build a set of labels that let me understand my content portfolio, build me a meaningful set of audiences that I can personalize ads to.

As data scientists, we commonly identify ask (1) as one of function approximation, where — writing a data point as (x, y) in D — we seek to choose a function from a class F in such a way that some error (or loss) metric L is minimized. Concretely, usually something like:

The actual processes of evaluating L and searching F generally leverage some nice (and local!) mathematical structure of the data and function spaces — usually the fact that D is a metric space and F are differentiable.

By contrast, ask (2) often amounts to an ask to find meaningful “clusters” within the data. Perhaps they are groups of users who behave similarly (audiences) or groups of content about the same topic (tags). In any case, this ask has an interpretation that does not explicitly rely on strong structural notions like metrics or differentiability. Instead it amounts to a notion of connectedness — a topological concept!

An example of article clustering

Let’s make this more concrete with an example. Suppose that we have a collection of articles, and we would like to construct a hierarchical labelling system. We can (via our favorite text embedding tool — say Google’s universal sentence encoder) consider these articles as points in a vector space, which we will now — in the style of topological data analysis — refer to as a point cloud. Our intuition is that the embedding tool encodes the semantic similarity of our content. That is, similar articles are close in the vector spaces, while very different articles are far apart.

A natural approach might be hierarchical agglomerative clustering. That is, we consider each point as its own separate connected component. We then put a ball around each point of radius r. As we increase r from 0, the connected components start to merge until we are left with only a single component. These components can be tracked via a dendrogram. A particular snapshot that exists during this process of growing the radius can be thought of as a horizontal slice of the dendrogram. We can picture this as below.

In looking at the point cloud and dendrogram, it’s likely that you had a fairly useful spatial intuition. I imagine each point as a growing glob of liquid mercury that starts to merge with other globs until it’s one giant glob. Can we make this (potentially useful) intuition precise? The answer is yes, and in fact, in many ways.

The missing concept we need at this time is that of a simplicial complex — it will help us move from an image of mercury globs into something more formal. While we won’t define it precisely at this moment, you can think of it informally as a generalization of a graph. Instead of stopping at a vertex set and an edge set though, we will allow “edges” of higher dimension (we call such a generalized edge a simplex). For example three vertices might share a 2-dimensional face along with its 1-dimensional edges and 0-dimensional points.

One simplicial complex that we can associate to a point cloud is called the Vietoris-Rips Complex. For a given radius r — a simplex exists between points if and only if their balls intersect pairwise — for all pairs. (The Cech complex is another popular definition where the intersection of all balls needs to be non-empty, not just pairwise — this is slightly different). For the smallest (red) radius, the Vietoris-Rips complex would just be 9 points in the plane. The Vietoris-Rips complex for the blue radius is drawn in figure 3. Drawing the VR-complex for the orange radius is tricky, but I invite you to try!

In any case, for a given radius, the VR complex is a topological space (with the topology it inherits from the embedding space), and so we can ask questions about connectedness or (more generally) something called its homology — a very useful concept we’ll make precise later.

As you’ve no doubt noticed, our discussion in the previous section was sort of unsatisfying due to the presence of the radius parameter. Which radius should I choose after all?! As is often the case with these questions, the best answer should somehow be to “use all the radii at once.” Let’s go back to our example point cloud to try to think this through in more detail.

Looking at the point cloud in figure 1, how many clusters would you say there are? I would say there are three. Is this visible in the dendrogram in any way? (After all, it represents all radii). Yes, it is! Most clusters are quite short lived (measuring time in the value of the radius), but there is a long period where there are three clusters. Moreover, this lifespan (or persistence) of the topological property of the three connected components quantifies the extent to which it exists in the point cloud at a “global level”.

This makes sense if all we care about is the number of connected components, but we would really like to consider the more complex topological properties as well. To do this, we need the concept of a filtration of simplicial complexes — essentially a family of nested simplicial complexes indexed by some subset of the real numbers. In our case, the family of Vietoris-Rips complexes we get by letting the radius range over positive real numbers is a filtration.

What we need now is a tool to allow us to study the way topological properties like homology changes over a filtration. If we had such a thing, the short lived homological properties would do a great job of describing the local geometry (something we still care about!) while the more persistent properties would give us the new “global” information we crave. It turns out that such a tool does exist — persistence homology.

Formalizing our intuition

With our intuition fully primed, we now take some steps toward formalizing our intuition of simplicial complexes, homology, and persistence.

What is a topological space?

A topological space is a set X along with a topology τ on that set. A topology on a set is a collection of subsets — typically called open sets — that have to obey certain rules. Namely:

  1. The topology must contain X and the empty set.
  2. An arbitrary union of sets in the topology is in the topology (topologies are closed under arbitrary unions).
  3. A finite intersection of sets in the topology is in the topology (topologies are closed under finite unions).

For our discussion, we will primarily be interested in the topology induced by the Euclidean metric in a finite dimensional real vector space V usually called the Euclidean topology. More concretely, we consider the topology generated by open balls B(p, r) = {x ∈ V | d(p, x) < r} where d is the Euclidean metric (you can play this game with any metric though). This means that open sets will be arbitrary unions and finite intersections of open balls.

The topological spaces of interest to us will not be Euclidean spaces, but rather simplicial complexes sitting inside a Euclidean space. In cases like this, the subspace S ⊂ X can inherit the topology from the ambient space, something called the subspace topology. This is done simply by saying that a set U ⊂ S in the subspace if and only if it can be written as the intersection of an open set in the larger space with the subspace, e.g. U = W ∩ S for some open set W ⊂ X.

From point clouds to topological spaces

As we’ve mentioned before, for many data science projects. The initial setting is one of a point cloud, and we would like to transition to the setting of a topological space. This is can be done naturally in a couple of ways (we’ve mentioned the Vietoris-Rips and Cech complexes) by creating a simplicial complex structure on points. We will now formally define these concepts and give some examples.

A note to the reader now — I can’t seem to get subscripts and superscripts to work out how I’d like on medium, so I’ll just be appending subscripts as a normal character. For example, “x sub k” will just be “xk”. My apologies on the unsightly nature of this — if you know how to fix this, please drop a comment! The formulas appearing as images should look correct as I was able to TeX them.

Let X = {x0, . . . , xk} be k + 1 affinely independent points in real n-dimensional space (no three lie on a line, no four lie on a plane, etc). The k-dimensional simplex σ = [x0, . . . , xk] spanned by X is the convex hull of X. The points of X are called vertices of σ and the simplexes spanned by subsets of X are called faces.

A geometric simplicial complex K in real n-dimensional space is a set of simplexes such that:

  1. Any face of a simplex of K is a simplex of K.
  2. The intersection of any two simplexes of K is empty or a common face of both.

For example, a graph consists of a collection of vertices (0-simplexes) and edges (1- simplexes). The intersection of two edges will either be a common vertex or the empty set. In this way, a graph is a 1-dimensional simplicial complex.

That said, with a simplicial complex we can generalize the relationships usually modeled by graphs to higher cardinality sets of vertices (but in a more structured way than the concept of a hypergraph). For example, with a simplicial complex we can specify a relationship between three vertices that extends beyond mutual pairwise relationships as described in figure 4.

Another source of examples are the Vietoris-Rips and Cech complexes — simplicial structures we can put on a point cloud. Both definitions depend on a real number parameter α > 0, which will specify an incidence threshold within the underlying metric space.

In particular, under the Vietoris-Rips definition, we say that a collection of points {x0, . . . , xk} form a simplex if and only if d(xi , xj ) < α for all pairs of points. Under the Cech definition, we say that a collection of points {x0, . . . , xk} forms a simplex if and only if the intersection of all the balls is non-empty.

So how do we distinguish between complexes at the level of topological spaces? For example, how would we identify that one of the simplicial complexes in figure 4 is a loop, while the other is “filled in”? This distinction can be captured by the homology of the space, which (among other things) identifies things like holes that exist in the space.

The homology of a simplicial complex

We find ourselves in the situation now where we can go from a point cloud to a topological space by endowing it with a simplicial structure. What we lack currently is:

  1. A way to identify properties of a topological space.
  2. A way to understand and quantify differences between topological spaces.

We will start by discussing different notions of equivalence in the category of topological spaces (when are two spaces the same?) and then move on to this mythical concept we’ve mentioned a few time — homology.

The strictest (and most natural) notion of equivalence between two topological spaces would be that of homeomorphism. As in any category, two topological spaces X and Y are homeomorphic if there exist (continuous) maps f : X → Y and g : Y → X such that f ◦ g is the identity map on Y and g ◦ f is the identity map on X.

It turns out, this notion of homeomorphism may be a bit too strong for our purposes, so we also introduce the concept of homotopy equivalence of two spaces. We say that two continuous maps f, g : X → Y are homotopic if there exists a continuous map H : X × [0, 1] → Y such that H(x, 0) = f(x) and H(x, 1) = g(x) for all x ∈ X. Two spaces X and Y are homotopy equivalent if there exists maps f : X → Y and g : Y → X so that f ◦ g and g ◦ f are homotopy equivalent to the identity maps on their respective spaces. While there are many nice things about the notion of homotopy equivalence, one thing we will leverage is that homotopy equivalent spaces will have the same homology.

Defining and computing homology groups

So what is this mysterious homology that keeps being brought up? Informally, it is a way of describing the structure of a topological space that is invariant under homotopy equivalence. For each integer k, there is a kth homology group that describes the k-dimensional structure of the space. Moreover, this description is functorial — something mathematicians tend to get excited about. This means that a map between topological spaces X → Y induces a map on homology group H(X, k) → H(Y, k). This can be a very powerful property as it allows us to push questions between categories — in our case from the category of (sometimes mystical) topological spaces to the category of (very concrete) abelian groups.

Let’s look at how to compute it and then some examples.

The computation of homology on a topological space X begins with the definition of a chain complex C(X) on X, which encodes the information of X into a sequence of abelian groups C0, C1, C2, . . . . Informally, this captures the simplexes that X is built out of and how they are glued together. This gluing information is encoded in boundary operators ∂n : Cn → Cn−1 (note the funny subscript typesetting convention I’m using, sorry!) which yields a sequence:

the ith homology group is defined as ker(∂i)/im(∂i+1) and should intuitively capture the gaps, holes, voids, etc present in the ith dimension. The kernel of a map (ker) is the elements that are mapped to 0. The image of a map (im) is the elements in the target that are “hit” by the map. Let’s look at some examples.

In figure 5 we have a “filled in triangle”, which we could think of as a single 2-simplex bounded by three 1-simplexes bounded by three 0-simplexes (or vertices). We first choose an orientation on this complex, which can be described with the chain complex:

where the nontrivial maps are given by:

We leave it as an exercise to the reader to verify that the homology groups in this setting are H2(X) = H1(X) = 0 and H0(X) = Z (Z here means the integers). This does require some idea of how quotients of abelian groups work, but I hope that it’s still intuitive if you haven’t had a course in group theory. In any case, these homology groups are usually explained by saying that there are no “voids” present in the structure (H2(X) = 0), no “holes” (H1(X) = 0), and only one connected component rank(H0(X) = 1).

Incidentally, these are the homology groups of a point as well! Indeed, this filled in triangle is homotopy equivalent to a point (but not homeomorphic to a point). To see this, you can imagine continuously shrinking the triangle to a point — something you can do without tearing or gluing up any part of the space.

Let’s contrast this conversation now with a non filled in triangle. What changes is that C2 is now zero (there are no 2-simplexes), so H1(X) = Z without the image of ∂2. H2(X) remains zero and H0(X) remains Z. This would be interpreted as a triangle not having any two-dimensional structure, having a 1-dimensional hole, and having one connected component.

The persistence diagram

As we have noticed in the above examples, when considering Vietoris-Rips or Cech complexes, the topology (and thus the homology) depends on the choice of radius. In the case of a Cech complex, if we start with three points and increase the radius about each one, we will eventually get a triangle, and then increasing more, find ourselves with a filled in triangle. A solution to this problem is considering how the homology changes or persists as the radius changes. Generally speaking, we will take the view that homological features (number of connected components, holes, voids, etc) which persist for a large range of radii are more fundamental to the point cloud structure than features that are very short lived. Note however, that in some cases the fingerprint of these short lived and noisy homological features can be very useful for prediction tasks.

A persistence diagram (or barcode) encodes the lifespans (in terms of radius) of homological features for a point cloud. Though there are a few ways to visualize this, we include a natural one in figure 6. Take our point cloud to be the corners of an equilateral triangle. For very small radii, there are three components H0 = Z³ , which eventually merge into a single triangle H0 = Z, H1 = Z, which eventually merges into a filled in triangle H0 = Z, H1 = 0.

There are many ways to extract properties from this barcode for use as features of the point cloud, both by encoding it as a vector or pulling out persistent homological properties.

Considering other simplicial complexes with lenses

While building simplicial structures on our data in the ways described so far are useful — they’re not the only game in town. In particular, there is a “relative” version of endowing this simplicial structure, which is quite satisfying from a mathematical point of view, and quite useful from a practical one. Specifically, the motivation is that we might want to view our data from a particular perspective. If we can encode that perspective as a map to a simple topological space, then we can pull back the topology of the simple space — having the informative map shape the topology on the original space. This idea is called the mapper algorithm, which we make precise now.

Given a space X and cover by open sets U = {Ui} for i∈I , the nerve of U, written C(U), is the abstract simplicial complex whose vertices are the Ui ’s and such that σ = [Ui0 , . . . , Uik] is a simplex in C(U) if and only if the intersection is non-empty. (Note that the Cech complex is a special case of this!).

Let’s start with the so-called nerve theorem. It says that the nerve of an open cover U = {Ui} of a topological space X is homotopy equivalent to the original space, provided the intersections of covering elements are contractible. Intuitively, this means that we have reason to think of a space in an abstract way through this concept of the nerve of a cover.

So we have, in the nerve concept, a way to understand topological spaces via the incidence structure of an open cover. How then should we go about choosing open covers? One natural source would be in pulling back open covers of well-behaved spaces (like low dimensional euclidean spaces, even R) via well-behaved maps. We’ll introduce a little more language here to make this precise.

If f : X → R^d is a continuous map and U = {Ui} is a cover of R^d, the pull-back cover of X is the collection of open sets given by pulling back the Ui under f. The refined pullback cover is the collection of connected components of these open sets.

With this language in hand, we can now very concisely describe the mapper algorithm. Given a point cloud X and a well-chosen function f : X → R^d, we consider the nerve of the refined pull-back cover of a simple cover of R^d (say by overlapping hypercubes).

Okay — at this point we’ve got a new (topological) perspective, an intuition about how it’s different and how it works, and we’ve got a formal understanding of what’s going on underneath that intuition. All that’s needed now are some practical examples to illustrate why all of this work was worth it. Let’s attempt that now.

Applications

Goal directed recommender systems

(This work is joint with Senior Data Scientist Martha Sheridan)

At Red Ventures — we’re generally in the business of helping you make decisions and answer questions that require some forethought and research. For example, things like can I buy a house? can I get my health under control? or even what car should I buy? In any case, these topics tend to have a goal state, a starting point, and a non-trivial path between them.

While standard recommender systems can effectively keep people engaged by serving up content that is relevant, while also helping you discover new interests — there aren’t many conceptual frameworks that allow you to naturally consider the global topology of your content portfolio. Specifically, we would like to be able to recommend “paths” of content starting with where the user is now (the article that brought them to the property) and ending at the goal state. Moreover, each step needs to be gradual (jumps aren’t so big that the reader gets lost), natural (each piece of content is semantically similar to the next), and goal-directed (we take meaningful steps toward the goal state, and we know we’re on a path that doesn’t dead-end).

To accomplish this goal, we’d really like a topological structure (even a graph structure) on our available content store, where incidence corresponds to our progress toward this goal as well as semantic similarity. This sounds like a perfect job for the mapper!

Indeed, we can achieve this by first leveraging open source text embedding tools that preserve semantic similarity to place our content within a common vector space. From there, we can develop models that estimate where along a scale from 0 to 1 a given piece of content lies in terms of progress toward the goal state (0 here could be a very early stage of research, whereas 1 would represent a state where the user is ready to take an action). Using this model as a lens, we can apply the mapper algorithm to endow our content embedding with a graph structure that shows us all of the goal-directed paths that meet our reader’s needs. Moreover, the nodes of this graph are collections of semantically similar articles, so additional personalization can be applied to further tailor the user experience.

Here’s an example of this in action in the case where a user is interested in getting a new 5g phone, but doesn’t even know what the technology is yet.

The coarseness of the cover of the target space (here the unit interval) will largely determine the length of the path, the amount of personalization that can be applied to a particular user’s other interests or characteristics, and the level of similarity between articles. We use KepplerMapper to apply the mapper algorithm. The result is a flexible and powerful recommendation engine that allows you to support goal directed behaviors.

Audience creation based on user journey shape

(This work is joint with Staff Data Scientist Florian Singhoff)

If you work on a data science team at an organization making products with a large user base, then chances are you’ve received the ask to help “understand our users better.” Your first passes might involve clustering (audiences of similar users) or creating ranges of users based on propensity models (high intent, high risk). TDA though can allow you to understand your users at yet another level. At Red Ventures, we see users come in with a big question, and we can (hopefully) help them solve it. I imagine these users as traveling down a (sometimes twisty) path. We used this intuition in the previous section to build an article recommendation system.

But we also see users come in and perform a more cyclical interaction style — pick a topic, research it, go down a rabbit hole or two, then come back to where they had started, research something else, rinse — repeat. I imagine these users as going around in a circle (or maybe several circles). These seem like two different types of people who would benefit from two different experiences and sets of tools. Could we learn to differentiate these “path users” from “circle users” in an automated way?

If you’re thinking this sounds like a perfect job for the first homology group, then you’ve been paying very close attention! Those with trivial first homology should be traveling a path, while those with non-trivial first homology will be traveling in loops! To make this idea work however, we need two important pieces of technology: (1) a reliable way to place a user at any point in time in a vector space that encodes their position in their user experience and (2) a way to calculate the persistence homology from a point cloud.

Luckily, we’ve been investing in (1) for years and have developed the Generalized Live User Embedding framework (GLUE for short) used to encode our users in meaningful ways just like the text encoding tools do for articles. Also luckily, we can leverage Ripser to endow our point clouds (here we have a point cloud for each user which encodes the trace of their user journey) with the structure of a simplicial complex and calculate the persistence homology.

With this strategy, we can use the “homology of a user journey” as a meaningful feature in understanding user behavior.

Conclusion

To sum up, TDA is a powerful mental framework as well as a collection of practical tools (like Ripser and KepplerMapper) that allow you to think about your data in both a local geometric perspective and a global topological one. This can unlock new value from data (especially the new world of embedded data), and I can only imagine that the power of this perspective will grow as we continue to find novel ways to pair this with our existing toolkits. Happy topologizing!

--

--