# Topological Data Analysis w/ Scikit-tda

Continuing from where we left off earlier…

# Persistent homology

Persistent homology is a method for computing topological features of a space at different spatial resolutions.

How do we do that?

We decide on a radius e.g. r = 0.5

And for each point in our dataset, we connect that point with lines to ALL other points which are within this radius from it.

We gradually and arbitrarily increase this radius. So that more and more points get connected to each point. To the extreme situation when the radius is so big that all points get connected to all points and our entire space of data points are covered and connected to each other.

Why do we do this in the first place?

Simply because this is the easiest way to connect arbitrary data points in arbitrary dimensional space. And once we do it, in the next step we can proceed with analysing and studying its topology i.e. shape features. Which is the entire purpose of our analysis i.e. to study the shape of our data (Aka Topology of our data).

# What are Barcodes?

If we increase the radius and plot the radius along the horizontal x-axis. And for each value of the radius, you connect the points in your dataset to create a simplicial complex, you will see various features (e.g. holes) appearing and disappearing in your simplicial complex. We plot these features as bar’s (aka barcodes) along the horizontal axis. The short bars correspond to noise and the long bars correspond to features.

It is very difficult to explain on paper. Please see the video below to understand this visually. (trust me you will understand it very easily on seeing the video in just a few minutes, even though you will struggle to explain it to someone else yourself)

The barcode is computable using Linear Algebra in O(N³) complexity. [Remember this. This is the only reason why Topology and Topological Analysis is computationally tractable as it doesn’t take exponential compute)

# Simplicial complex

In mathematics, a simplicial complex is a set composed of points, line segments, triangles, and their n-dimensional counterparts

The basic idea — Connect nearby points and create a Simplicial Complex.

# Algebraic topology

Official Definition — Algebraic topology is the study of intrinsic qualitative aspects of spatial objects (e.g., surfaces, spheres, tori, circles, knots, links, configuration spaces, etc.) that remain invariant under both-directions continuous one-to-one (homeomorphic) transformations.

In layman’s english, we would like to create algebraic representations of shape properties (e.g. holes) which do not change if we stretch or twist the shapes. Without cutting (with a scissor) or pasting (with a glue) the shape in any way.

Algebraic topology offers a mature set of tools for (i) counting and (ii) collating holes. Examples below…

• Euler Characteristic
• Betti Numbers

What are they? lets see below.

# Euler Characteristic (Algebraic Topology)

In algebraic topology and polyhedral combinatorics, the Euler characteristic (or Euler number, or Euler–Poincaré characteristic) is a topological invariant, a number that describes a topological space’s shape or structure regardless of the way it is bent.

# Betti Numbers (Algebraic Topology)

Betti numbers are topological objects which were proved to be invariants by Poincaré, and used by him to extend the polyhedral formula to higher dimensional spaces. Informally, the Betti number is the maximum number of cuts that can be made without dividing a surface into two separate pieces (Gardner 1984, pp. 9–10).

The first betti number ‘b0’ counts the number of 1 dimensional holes.

The second betti number ‘b1’ counts the number of 2 dimensional holes.

The third betti number ‘b2’ counts the number of 3 dimensional holes.

# 3 Properties of Topological Analysis

• Coordinate invariance
• Deformation invariance
• Compressed Representations — of shapes

# Enter Scikit-tda…

We assume you have python/conda installed.

You can now install scikit-tda by typing

`pip install scikit-tdaconda install jupyter notebook`

if you get an error and ripser fails to ‘build’ and install… don’t sweat… try doing this… (convoluted but works)

`conda update -q condaconda install libpython m2w64-toolchain cython numpyecho [build] > C:\Anaconda3\Lib\distutils\distutils.cfgecho compiler = mingw32 >> C:\Anaconda3\Lib\distutils\distutils.cfgpip install ripserpip install scikit-tdaconda install jupyter notebook`

# Hello World

`import numpy as npfrom ripser import ripserfrom persim import plot_diagramsdata = np.random.random((100,2))diagrams = ripser(data)['dgms']plot_diagrams(diagrams, show=True)`

# Now lets do two things…

• compute persistence homology of data sets,
• visualize persistence diagrams
`:%load_ext autoreload%autoreload 2:from ripser import ripserfrom persim import plot_diagramsimport matplotlib.pyplot as pltimport numpy as npfrom sklearn import datasets:data = datasets.make_circles(n_samples=100) + 5 * datasets.make_circles(n_samples=100)`

# Default args make it very easy.

Generate diagrams for H0 and H1

Plot both diagrams

`:dgms = ripser(data)['dgms']plot_diagrams(dgms, show=True)`
`:# Plot each diagram by itselfplot_diagrams(dgms, plot_only=, ax=plt.subplot(121))plot_diagrams(dgms, plot_only=, ax=plt.subplot(122))`

# Homology over any prime basis

We can compute homology over any p≥2 by supplying the argument `coeff=p`.

`:# Homology over Z/3Zdgms = ripser(data, coeff=3)['dgms']plot_diagrams(dgms, plot_only=, title="Homology of Z/3Z", show=True)`
`:# Homology over Z/7Zdgms = ripser(data, coeff=3)['dgms']plot_diagrams(dgms, plot_only=, title="Homology of Z/7Z", show=True) # Only plot H_1`

# Specify which homology classes to compute

We can compute any order of homology, H0,H1,H2,… By default, we only compute H0 and H1. You can specify a larger order by supplying the argument `maxdim=p`. It practice, anything above H1 is very slow.

`:dgms = ripser(data, maxdim=2)['dgms']plot_diagrams(dgms, show=True)`

# Specify maximum radius for Rips filtration

We can restrict the maximum radius of the VR complex by supplying the argument `thresh=r`. Certain classes will not be born if their birth time is under the threshold, and other classes will have infinite death times if their ordinary death time is above the threshold

`:dgms = ripser(data, thresh=0.2)['dgms']plot_diagrams(dgms, show=True)`
`:dgms = ripser(data, thresh=1)['dgms']plot_diagrams(dgms, show=True)`
`:dgms = ripser(data, thresh=2)['dgms']plot_diagrams(dgms, show=True)`
`:dgms = ripser(data, thresh=999)['dgms']plot_diagrams(dgms, show=True)`

# Plotting Options

`:# This looks weird, but it's possibleplot_diagrams(dgms, xy_range=[-2,10,-1,20])`

# Colormaps

See all available color maps with

`import matplotlib as mplprint(mpl.styles.available):plot_diagrams(dgms, colormap='seaborn')`

`:plot_diagrams(dgms, lifetime=True)`

# Source Code

Download ALL the jupyter notebooks in one .zip file from here … http://bit.ly/36JJ20h

The examples above are in the notebook called ‘Basic Usage.ipynb’

# Continue Learning…

This is our website http://automatski.com

Written by