Topological Data Analysis w/ Scikit-tda
Continuing from where we left off earlier…
Lets go through some Theory (of Topology)

Persistent homology
Persistent homology is a method for computing topological features of a space at different spatial resolutions.
How do we do that?
We decide on a radius e.g. r = 0.5
And for each point in our dataset, we connect that point with lines to ALL other points which are within this radius from it.
We gradually and arbitrarily increase this radius. So that more and more points get connected to each point. To the extreme situation when the radius is so big that all points get connected to all points and our entire space of data points are covered and connected to each other.
Why do we do this in the first place?
Simply because this is the easiest way to connect arbitrary data points in arbitrary dimensional space. And once we do it, in the next step we can proceed with analysing and studying its topology i.e. shape features. Which is the entire purpose of our analysis i.e. to study the shape of our data (Aka Topology of our data).
What are Barcodes?
If we increase the radius and plot the radius along the horizontal x-axis. And for each value of the radius, you connect the points in your dataset to create a simplicial complex, you will see various features (e.g. holes) appearing and disappearing in your simplicial complex. We plot these features as bar’s (aka barcodes) along the horizontal axis. The short bars correspond to noise and the long bars correspond to features.
It is very difficult to explain on paper. Please see the video below to understand this visually. (trust me you will understand it very easily on seeing the video in just a few minutes, even though you will struggle to explain it to someone else yourself)
The barcode is computable using Linear Algebra in O(N³) complexity. [Remember this. This is the only reason why Topology and Topological Analysis is computationally tractable as it doesn’t take exponential compute)

Simplicial complex
In mathematics, a simplicial complex is a set composed of points, line segments, triangles, and their n-dimensional counterparts
The basic idea — Connect nearby points and create a Simplicial Complex.
Algebraic topology
Official Definition — Algebraic topology is the study of intrinsic qualitative aspects of spatial objects (e.g., surfaces, spheres, tori, circles, knots, links, configuration spaces, etc.) that remain invariant under both-directions continuous one-to-one (homeomorphic) transformations.
In layman’s english, we would like to create algebraic representations of shape properties (e.g. holes) which do not change if we stretch or twist the shapes. Without cutting (with a scissor) or pasting (with a glue) the shape in any way.
Algebraic topology offers a mature set of tools for (i) counting and (ii) collating holes. Examples below…
- Euler Characteristic
- Betti Numbers
What are they? lets see below.
Euler Characteristic (Algebraic Topology)
In algebraic topology and polyhedral combinatorics, the Euler characteristic (or Euler number, or Euler–Poincaré characteristic) is a topological invariant, a number that describes a topological space’s shape or structure regardless of the way it is bent.

Betti Numbers (Algebraic Topology)
Betti numbers are topological objects which were proved to be invariants by Poincaré, and used by him to extend the polyhedral formula to higher dimensional spaces. Informally, the Betti number is the maximum number of cuts that can be made without dividing a surface into two separate pieces (Gardner 1984, pp. 9–10).
The first betti number ‘b0’ counts the number of 1 dimensional holes.
The second betti number ‘b1’ counts the number of 2 dimensional holes.
The third betti number ‘b2’ counts the number of 3 dimensional holes.
…
3 Properties of Topological Analysis
- Coordinate invariance
- Deformation invariance
- Compressed Representations — of shapes

Enter Scikit-tda…
We assume you have python/conda installed.
You can now install scikit-tda by typing
pip install scikit-tdaconda install jupyter notebook
if you get an error and ripser fails to ‘build’ and install… don’t sweat… try doing this… (convoluted but works)
conda update -q condaconda install libpython m2w64-toolchain cython numpyecho [build] > C:\Anaconda3\Lib\distutils\distutils.cfgecho compiler = mingw32 >> C:\Anaconda3\Lib\distutils\distutils.cfgpip install ripserpip install scikit-tdaconda install jupyter notebook
Hello World
import numpy as np
from ripser import ripser
from persim import plot_diagramsdata = np.random.random((100,2))
diagrams = ripser(data)['dgms']
plot_diagrams(diagrams, show=True)
Now lets do two things…
- compute persistence homology of data sets,
- visualize persistence diagrams
[1]:%load_ext autoreload
%autoreload 2[2]:from ripser import ripser
from persim import plot_diagrams
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets[3]:data = datasets.make_circles(n_samples=100)[0] + 5 * datasets.make_circles(n_samples=100)[0]
Default args make it very easy.
Generate diagrams for H0 and H1
Plot both diagrams
[4]:dgms = ripser(data)['dgms']
plot_diagrams(dgms, show=True)

[5]:# Plot each diagram by itself
plot_diagrams(dgms, plot_only=[0], ax=plt.subplot(121))
plot_diagrams(dgms, plot_only=[1], ax=plt.subplot(122))

Homology over any prime basis
We can compute homology over any p≥2 by supplying the argument coeff=p.
[6]:# Homology over Z/3Z
dgms = ripser(data, coeff=3)['dgms']
plot_diagrams(dgms, plot_only=[1], title="Homology of Z/3Z", show=True)

[7]:# Homology over Z/7Z
dgms = ripser(data, coeff=3)['dgms']
plot_diagrams(dgms, plot_only=[1], title="Homology of Z/7Z", show=True) # Only plot H_1

Specify which homology classes to compute
We can compute any order of homology, H0,H1,H2,… By default, we only compute H0 and H1. You can specify a larger order by supplying the argument maxdim=p. It practice, anything above H1 is very slow.
[8]:dgms = ripser(data, maxdim=2)['dgms']
plot_diagrams(dgms, show=True)

Specify maximum radius for Rips filtration
We can restrict the maximum radius of the VR complex by supplying the argument thresh=r. Certain classes will not be born if their birth time is under the threshold, and other classes will have infinite death times if their ordinary death time is above the threshold
[9]:dgms = ripser(data, thresh=0.2)['dgms']
plot_diagrams(dgms, show=True)

[10]:dgms = ripser(data, thresh=1)['dgms']
plot_diagrams(dgms, show=True)

[11]:dgms = ripser(data, thresh=2)['dgms']
plot_diagrams(dgms, show=True)

[12]:dgms = ripser(data, thresh=999)['dgms']
plot_diagrams(dgms, show=True)

Plotting Options
[13]:# This looks weird, but it's possible
plot_diagrams(dgms, xy_range=[-2,10,-1,20])

Colormaps
See all available color maps with
import matplotlib as mpl
print(mpl.styles.available)[14]:plot_diagrams(dgms, colormap='seaborn')

Plot lifetime of generators
[15]:plot_diagrams(dgms, lifetime=True)

Source Code
Download ALL the jupyter notebooks in one .zip file from here … http://bit.ly/36JJ20h
The examples above are in the notebook called ‘Basic Usage.ipynb’
Continue Learning…
About us
This is our website http://automatski.com

