Pollen analysis of the thorny scrub in python

Telling the story of a pollen

Elaine Grenot
MCD-UNISON
5 min readJun 2, 2023

--

Pollen analysis is often complex, especially when you want to analyze its temporal component and at the same time, its magnitude. If you are one of those who end up with a headache after analyzing a pollen diagram or you want just to draw conclusions at a quick view, what I propose here is may be useful.

In a few lines of python code we will plot our own mini pollen diagram using as a study case different pollens of the thorn scrub. We will use the libraries plotly.express and matplotlib.pyplot to plot; pandas and numpy to load and transform the data; and sklearn to finally make the pollen story, using PCA as a dimensionality reduction method and KMeans to perform clustering.

The Data and its Characters

As I mentioned before, the data that we will use corresponds only to pollen of the thorny scrub type. Each record corresponds to the magnitude vector of the different pollen for a level of depth or time.

# Importing libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

plt.style.use('ggplot')

In our case the data is in xlsx format, and for that reason we will use read_excel from pandas.

# Loading data
df = pd.read_excel("./data_polen/data.xlsx")

Now let’s going to normalize.

# Indexing data according to depth
df_indexed = df.set_index('Depth')

# Normalizing data
scaler = StandardScaler()
scaler.fit(df_indexed)
df_scaled = scaler.transform(df_indexed)

dataframe_scaled = pd.DataFrame(data=df_scaled,
index = df_indexed.index,
columns=df_indexed.columns)

Performance Autocovariance Matrix

One of the challenges in pollen analysis is to include the temporal factor when making clustering to determine pollen zones. A valid way may be to use the covariance between a taxon vector (pollen column) with itself but with a lag.

Where:

  • 𝑋 is the taxon vector.
  • Xₖ is the vector of the taxon k times out of phase.
  • xᵢ and xᵢ₋ₖ are the corresponding elements of the taxon and the lag taxon, respectively.
  • y x̄ₖ are the means of the elements of the taxon and the taxon with lag k.
  • 𝑛 is the number of elements of the taxon vector.

We can then build an autocovariance matrix where each column corresponds to the taxon or pollen type and each row corresponds to the lag at depth level 𝑘

  • Each element aᵢⱼ of the matrix corresponds to the covariance between taxon 𝑖 with lag 𝑗 (depth of index 𝑗)
  • Each element 𝑗 of column 𝑖 can be interpreted as the covariance of taxon 𝑖 at depth 𝑗
  • Each row 𝑗 can be interpreted as the covariance of the different taxons at depth j

In the following, some codes to perform the autocovariance matrix

def temporal_covariance(x, y, lag):
'''Calculates the covariance between two time series given a lag'''
n = len(x)
# Creating two new arrays with the lag
x_shifted = x[:n-lag]
y_shifted = y[lag:]
# Calculate the covariance
covariance = np.cov(x_shifted, y_shifted)[0,1]

return covariance
# Constructing autocovariance matrix
feature_mtx = dataframe_scaled.values

leng_deep = feature_mtx.shape[1]
leng_tax = feature_mtx.shape[0]
AutoCorr = np.zeros([leng_deep-2,leng_tax])
for i in range(leng_tax):
for j in range(leng_deep-2):
AutoCorr[j,i] = temporal_covariance(feature_mtx[i], feature_mtx[i], j+1)

Now we have our data ready to go on the scene!

df_AutoCorr= pd.DataFrame(data = AutoCorr, 
index = df_scaled.columns[0:109],
columns = df_scaled.index.get_level_values(0))

The Scene

Our data has now 41 columns that correspond to each pollen, but what if we reduce the dimension of the dataframedf_AutoCorr and obtain one with much fewer columns that represents the greatest possible variability of our data. PCA is just made for that, let’s see how to code it.

pca = PCA(n_components=16)

pca.fit_transform(AutoCorr)
prop_var = pca.explained_variance_ratio_
eigenvalues = pca.explained_variance_

PCA_numbers = np.arange(pca.n_components_) + 1

plt.plot(PCA_numbers,
prop_var.cumsum(),
'o-')
plt.ylabel('Cumulative variance ratio', fontsize=8)
plt.show()

In the previous plot we can see how with 5 components we capture close to 60% of the variability of our data. That’s great!

components = 5
pca = PCA(n_components=components)
scores_pca = pca.fit_transform(AutoCorr)

Our next step is clustering the depth levels in order to see similarities and changes between them. There are several types of cluster methods, and your job it is to see which one is the best for your data and your problem.

I will use the well known KMeans method. Please note below that I chose 4 as the cluster number, but you can change it according to your data.

n_clusters = 4
kmeans_pca = KMeans(n_clusters = n_clusters, n_init = 20, random_state = 123)
kmeans_pca.fit(scores_pca)

The Story

We only have to build our mini pollen diagram and see what story is telling us. For this we are going to make a scatter plot where the size will be the magnitude of our original data and the color will be determined by the cluster.

Let’s see the code.

# Concatenated original data with the values ​​of principal components and clusters
df_pca_kmeans = pd.concat([df[:109].reset_index(),
pd.DataFrame(scores_pca)], axis=1)
df_pca_kmeans.columns.values[-5: ] = ['Component_1', 'Component_2', 'Component_3',
'Component_4', 'Component_5']
df_pca_kmeans['Cluster'] = kmeans_pca.labels_

# Graphing
df_pca_kmeans_ = df_pca_kmeans.melt(id_vars=['Depth', 'Cluster',
'Component_1', 'Component_2', 'Component_3',
'Component_4', 'Component_5'])

fig = px.scatter(df_pca_kmeans_, x="Depth", y="variable",
size="value", color="Cluster",
hover_name="variable")
fig.update_layout(xaxis_title="Depth", yaxis_title="")
fig.show()

Pretty amazing, don’t you think?

In this plot we can see an overview of the thorny scrub. For example, there was a drastic change at depth 75 (purple to blue). At depth 171 (yellow), something happened, but it wasn’t drastic enough for the next ones to cluster the same. Something similar happens with depth 216 (purple) and 276 (orange). Also, level 216 is more similar to the first group (from 1 to 75).

Another thing we can look at are the magnitudes. There is a greater quantity for the Amaranthaceae pollen and after depth 270 all the magnitudes are very small.

What else does the plot is telling you?

Note: For reference, you can find all code here.

Thanks For Reading

--

--