Topological Data Analysis w/ Scikit-tda

Aditya Yadav
Nov 8, 2019 · 6 min read

Continuing from where we left off earlier…

Lets go through some Theory (of Topology)

Persistent homology

Persistent homology is a method for computing topological features of a space at different spatial resolutions.

How do we do that?

We decide on a radius e.g. r = 0.5

And for each point in our dataset, we connect that point with lines to ALL other points which are within this radius from it.

We gradually and arbitrarily increase this radius. So that more and more points get connected to each point. To the extreme situation when the radius is so big that all points get connected to all points and our entire space of data points are covered and connected to each other.

Why do we do this in the first place?

Simply because this is the easiest way to connect arbitrary data points in arbitrary dimensional space. And once we do it, in the next step we can proceed with analysing and studying its topology i.e. shape features. Which is the entire purpose of our analysis i.e. to study the shape of our data (Aka Topology of our data).

What are Barcodes?

If we increase the radius and plot the radius along the horizontal x-axis. And for each value of the radius, you connect the points in your dataset to create a simplicial complex, you will see various features (e.g. holes) appearing and disappearing in your simplicial complex. We plot these features as bar’s (aka barcodes) along the horizontal axis. The short bars correspond to noise and the long bars correspond to features.

It is very difficult to explain on paper. Please see the video below to understand this visually. (trust me you will understand it very easily on seeing the video in just a few minutes, even though you will struggle to explain it to someone else yourself)

The barcode is computable using Linear Algebra in O(N³) complexity. [Remember this. This is the only reason why Topology and Topological Analysis is computationally tractable as it doesn’t take exponential compute)

Simplicial complex

In mathematics, a simplicial complex is a set composed of points, line segments, triangles, and their n-dimensional counterparts

The basic idea — Connect nearby points and create a Simplicial Complex.

Algebraic topology

Official Definition — Algebraic topology is the study of intrinsic qualitative aspects of spatial objects (e.g., surfaces, spheres, tori, circles, knots, links, configuration spaces, etc.) that remain invariant under both-directions continuous one-to-one (homeomorphic) transformations.

In layman’s english, we would like to create algebraic representations of shape properties (e.g. holes) which do not change if we stretch or twist the shapes. Without cutting (with a scissor) or pasting (with a glue) the shape in any way.

Algebraic topology offers a mature set of tools for (i) counting and (ii) collating holes. Examples below…

  • Euler Characteristic
  • Betti Numbers

What are they? lets see below.

Euler Characteristic (Algebraic Topology)

In algebraic topology and polyhedral combinatorics, the Euler characteristic (or Euler number, or Euler–Poincaré characteristic) is a topological invariant, a number that describes a topological space’s shape or structure regardless of the way it is bent.

Betti Numbers (Algebraic Topology)

Betti numbers are topological objects which were proved to be invariants by Poincaré, and used by him to extend the polyhedral formula to higher dimensional spaces. Informally, the Betti number is the maximum number of cuts that can be made without dividing a surface into two separate pieces (Gardner 1984, pp. 9–10).

The first betti number ‘b0’ counts the number of 1 dimensional holes.

The second betti number ‘b1’ counts the number of 2 dimensional holes.

The third betti number ‘b2’ counts the number of 3 dimensional holes.

3 Properties of Topological Analysis

  • Coordinate invariance
  • Deformation invariance
  • Compressed Representations — of shapes

Enter Scikit-tda…

We assume you have python/conda installed.

You can now install scikit-tda by typing

pip install scikit-tdaconda install jupyter notebook

if you get an error and ripser fails to ‘build’ and install… don’t sweat… try doing this… (convoluted but works)

conda update -q condaconda install libpython m2w64-toolchain cython numpyecho [build] > C:\Anaconda3\Lib\distutils\distutils.cfgecho compiler = mingw32 >> C:\Anaconda3\Lib\distutils\distutils.cfgpip install ripserpip install scikit-tdaconda install jupyter notebook

Hello World

import numpy as np
from ripser import ripser
from persim import plot_diagrams
data = np.random.random((100,2))
diagrams = ripser(data)['dgms']
plot_diagrams(diagrams, show=True)

Now lets do two things…

  • compute persistence homology of data sets,
  • visualize persistence diagrams
[1]:%load_ext autoreload
%autoreload 2
[2]:from ripser import ripser
from persim import plot_diagrams
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
[3]:data = datasets.make_circles(n_samples=100)[0] + 5 * datasets.make_circles(n_samples=100)[0]

Default args make it very easy.

Generate diagrams for H0 and H1

Plot both diagrams

[4]:dgms = ripser(data)['dgms']
plot_diagrams(dgms, show=True)
[5]:# Plot each diagram by itself
plot_diagrams(dgms, plot_only=[0], ax=plt.subplot(121))
plot_diagrams(dgms, plot_only=[1], ax=plt.subplot(122))

Homology over any prime basis

We can compute homology over any p≥2 by supplying the argument coeff=p.

[6]:# Homology over Z/3Z
dgms = ripser(data, coeff=3)['dgms']
plot_diagrams(dgms, plot_only=[1], title="Homology of Z/3Z", show=True)
[7]:# Homology over Z/7Z
dgms = ripser(data, coeff=3)['dgms']
plot_diagrams(dgms, plot_only=[1], title="Homology of Z/7Z", show=True) # Only plot H_1

Specify which homology classes to compute

We can compute any order of homology, H0,H1,H2,… By default, we only compute H0 and H1. You can specify a larger order by supplying the argument maxdim=p. It practice, anything above H1 is very slow.

[8]:dgms = ripser(data, maxdim=2)['dgms']
plot_diagrams(dgms, show=True)

Specify maximum radius for Rips filtration

We can restrict the maximum radius of the VR complex by supplying the argument thresh=r. Certain classes will not be born if their birth time is under the threshold, and other classes will have infinite death times if their ordinary death time is above the threshold

[9]:dgms = ripser(data, thresh=0.2)['dgms']
plot_diagrams(dgms, show=True)
[10]:dgms = ripser(data, thresh=1)['dgms']
plot_diagrams(dgms, show=True)
[11]:dgms = ripser(data, thresh=2)['dgms']
plot_diagrams(dgms, show=True)
[12]:dgms = ripser(data, thresh=999)['dgms']
plot_diagrams(dgms, show=True)

Plotting Options

[13]:# This looks weird, but it's possible
plot_diagrams(dgms, xy_range=[-2,10,-1,20])


See all available color maps with

import matplotlib as mpl
[14]:plot_diagrams(dgms, colormap='seaborn')

Plot lifetime of generators

[15]:plot_diagrams(dgms, lifetime=True)

Source Code

Download ALL the jupyter notebooks in one .zip file from here …

The examples above are in the notebook called ‘Basic Usage.ipynb’

Continue Learning…

About us

This is our website


The Meaning of Life, Universe & Everything

Aditya Yadav

Written by

Millennium Inventor



The Meaning of Life, Universe & Everything

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade