Topological Data Analysis — A Very Short Introduction

Varad Deshmukh
7 min readMar 27, 2019

--

Topological Data Analysis (TDA) has been a successfully applied to a range of applications in the recent years — whether it is to process and segment a digital image, gain insights into patterns formed by biological systems such as flocks of birds or a herd of buffaloes, positioning of sensor networks, or simply detect from a set of discrete data-points the underlying shape of the object they reside on. On the lines of machine learning, TDA belongs to a category of mathematical tools that aim to determine mathematical associations or patterns in data from complex systems, without claiming to understand their inner mechanisms. These tools don’t need to understand the biological functioning of birds to provide a interpretation of how they group together in a flock over time. This is a bold approach; however, the age of big data has made this a viable, and sometimes a more preferred one. The difference between TDA and general ML is that TDA is specifically concerned with the analyzing patterns or properties pertinent to the shape of the data.

Figure 1: A flock of birds forming a beautiful pattern in the sky

What is topology?

Very simplistically, topology is the study of shape of objects subject to smooth operations such as stretching and twisting. However, topology is not concerned with the exact geometric description of an object: it does not bother itself with how many edges an object has, or whether an object is round or oval in shape. Topology tries to describe an object in more general terms — how many distinct components does an object have, how many holes on its surface, or how many cavities. As an example, look at the two familiar objects below — a doughnut and a coffee cup. Very different in terms of the exact geometrical shape, but perform a few transformations on their mouldable clay models — and your doughnut becomes a coffee cup. Both of them are equivalent in a sense. They have a single visible cavity and are one complete connected object. They are the same object topologically.

Figure 2: A Coffee Cup has the same topology as a …
… a doughnut

At the cost of being more formal, topology of an object is described by a set of numbers called as the Betti numbers, each number β(k) describing the number of holes an object contains in k-dimensions; β(0) is then the number of connected components of your object, β(1) is the number of 1-dimensional holes, β(2) is the number of 2-dimensional holes, and so on.

“Mathematically, the Betti numbers of the above donut/coffee mug/torus are: β(0) = 1, β(1) = 2, β(2) = 1.”

For now, we don’t need to get into the mess of these Betti numbers. Just let it sink in that β(0) is the contiguous components in the object, β(1) is the flat total gaps in the 2D plane area of the object, and β(2) is the total 3D gaps in the object. We will discuss this in details over the course of these articles.

But then what is the topology of a real world data?

It is easy to define a topology on a well-defined surface. But what do you do with discrete data-points observed from a real-world complex system? How do you define a topology for a flock of birds or a digital image? The solution is Topological Data Analysis — using a tool called as Persistent Homology. Persistent Homology builds relationships between data-points by connecting them together through some well-defined rules. We’ll see what those rules are soon, but for now assume that we have defined connections between data in a particular fashion. What are these connections? Are they simple line segments (1-dimensional objects) between points so that we have generated a complex network from our 2-dimensional data? But then, one would argue — why stop at 0 and 1-dimensional connections? Why not define triangles, tetrahedrons and higher-dimensional connections?

Figure 3: Connections (called simplices) in various dimensions (Source: Harold Serrano)

A simplicial complex is a structure generated from such generalized connections between data-points. Figure 4 below shows a 2D projection of Lorenz attractor in its full majesty, its discrete point-data, and a simplicial complex generated from it respectively. The simplicial complex in the figure is generated from triangles, lines and points.

Figure 4: A 2D projection of the Lorenz system trajectory on the left and its point cloud data on the right
Figure 5: A simplicial complex generated from the point cloud data of Lorenz.

We can easily spot the commonality between the full dynamics in Figure 4 and the simplicial complex in Figure 5 — the presence of the two dominant holes. So what does this mean for us? Essentially a simplicial complex is able to reconstruct the topological structure of an object from discrete data. And so, generally speaking,

“Determining a simplicial complex from a point cloud data, if done right, can extract the exact number of holes (in different dimensions), as the original object”

Or equivalently,

“A simplicial complex of a discrete point cloud data can reconstruct the topology of the underlying object”

This is pretty powerful, because now we can use this tool to define a topology for systems that inherently don’t possess shape, and so, attempt to explain them better. By defining a simplicial complex over an arrangement of flock in the sky, we can make claims about the topology of this arrangement and associate it with the state of the underlying system. This is where we come back full circle — studying a system based on its geometric/topological manifestation and circumventing a detailed complex analysis that requires domain knowledge.

The Catch?

There’s a potential challenge in this rosy picture we just presented. The simplicial complex and the associated connections are based on certain rules. How do you decide which two points should be joined by a line, and which shouldn’t? Which points should be connected together by a triangle, instead of pairwise segments? We won’t go through these rules in detail, but a crucial point to be driven here is that the simplicial complex like the one shown in the above figure, are defined only for a specific range of distances. Their structure changes as you change the distance. The figure below shows how the complex changes as you change a spatial scale parameter — Ɛ.

Figure 4: Changes in the simplicial complex as a function of the distance parameter — Ɛ (Source: Persitence Homology of Embedded Time Series).

For a small Ɛ, we had a disconnected skeleton with hardly a few points connected, whereas for a large Ɛ, the connections were so dense that the two holes of the Lorenz attractor were filled up. Neither of these situations represent the attractor; its only a certain range of the spatial scale parameter that a simplicial complex more closely resembles the original attractor.

Of course, there’s ways to handle that — and eventually we will discuss filtration techniques and complexes (the rules that define the connections), and barcodes/persistence diagrams (rules that define which persistent features are important) in future articles.

TDA and ML

TDA provides a new approach of understanding patterns in your data that are associated with its shape. Its hard to decide whether machine learning would do a better job of figuring out these patterns. One advantage that TDA definitely offers is interpretability — something that ML falls short in current times. TDA defines formalized methods and tools to explicit define shape, whereas ML is restricted to making possible inferences between data and shape definitions. And its not to say that they cannot work together. Properties extracted by TDA can serve as effective features for ML algorithms, whereas the interpretation of algorithms can validate the claims by TDA about system behavior. But more on this later.

Conclusions

Topological Data Analysis (or TDA) is an exciting new tool that is being rapidly applied to a variety of complex systems by investigating their shape. Through the course of my yammering, I hope to have impressed on you that TDA doesn’t make any assumptions about the mechanism of the system while analyzing its shape.

There’s a lot of stuff I have skipped and love to go into the details of:

  1. What are the rules for creating simplicial complexes?
  2. What is a “hole” in a simplicial complex?
  3. What tools do I use for doing TDA? Should I have to write from scratch the code for finding k-dimensional holes in my data?
  4. What is the mathematics underlying these fascinating set of tools? Can I develop an intuition without going through Introduction to Topology?

Through a series of articles, I will cover these topics in detail. I am an aspiring Computer Scientist who aims to make complex systems his bread and butter for life. As I am learning the tools and mathematics and applying them in my research, I thought it would be a good idea to share them with the community as I go along. As a CS student trying to understand mathematics, I will attempt to fill these articles with as much intuition and hands-on experience as possible.

I hope you find the information useful. Until next time — where I discuss the intuition behind simplicial complexes and Betti numbers!

References and Further Coding/Reading/Watching

  1. Chad Topaz’s video on applying TDA to biological aggregation data (one of my favorites): https://www.youtube.com/watch?v=mkvdUtZ79jk
  2. Chad Topaz’s tutorial on Homology: https://drive.google.com/file/d/0B3Www1z6Tm8xblBVRFBpeWpfREk/view
  3. Persistent Homology of Embedded Time-Series Data (PHETS): https://github.com/lizbradley/PHETS
  4. A Mathematician’s perspective on Topological Data Analysis: https://rviews.rstudio.com/2018/11/14/a-mathematician-s-perspective-on-topological-data-analysis-and-r/

--

--