Pollen analysis of the thorny scrub in python
Telling the story of a pollen
Pollen analysis is often complex, especially when you want to analyze its temporal component and at the same time, its magnitude. If you are one of those who end up with a headache after analyzing a pollen diagram or you want just to draw conclusions at a quick view, what I propose here is may be useful.
In a few lines of python code we will plot our own mini pollen diagram using as a study case different pollens of the thorn scrub. We will use the libraries plotly.express and matplotlib.pyplot to plot; pandas and numpy to load and transform the data; and sklearn to finally make the pollen story, using PCA as a dimensionality reduction method and KMeans to perform clustering.
The Data and its Characters
As I mentioned before, the data that we will use corresponds only to pollen of the thorny scrub type. Each record corresponds to the magnitude vector of the different pollen for a level of depth or time.
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
plt.style.use('ggplot')
In our case the data is in xlsx format, and for that reason we will use read_excel
from pandas.
# Loading data
df = pd.read_excel("./data_polen/data.xlsx")
Now let’s going to normalize.
# Indexing data according to depth
df_indexed = df.set_index('Depth')
# Normalizing data
scaler = StandardScaler()
scaler.fit(df_indexed)
df_scaled = scaler.transform(df_indexed)
dataframe_scaled = pd.DataFrame(data=df_scaled,
index = df_indexed.index,
columns=df_indexed.columns)
Performance Autocovariance Matrix
One of the challenges in pollen analysis is to include the temporal factor when making clustering to determine pollen zones. A valid way may be to use the covariance between a taxon vector (pollen column) with itself but with a lag.
Where:
- 𝑋 is the taxon vector.
- Xₖ is the vector of the taxon k times out of phase.
- xᵢ and xᵢ₋ₖ are the corresponding elements of the taxon and the lag taxon, respectively.
- x̄ y x̄ₖ are the means of the elements of the taxon and the taxon with lag k.
- 𝑛 is the number of elements of the taxon vector.
We can then build an autocovariance matrix where each column corresponds to the taxon or pollen type and each row corresponds to the lag at depth level 𝑘
- Each element aᵢⱼ of the matrix corresponds to the covariance between taxon 𝑖 with lag 𝑗 (depth of index 𝑗)
- Each element 𝑗 of column 𝑖 can be interpreted as the covariance of taxon 𝑖 at depth 𝑗
- Each row 𝑗 can be interpreted as the covariance of the different taxons at depth j
In the following, some codes to perform the autocovariance matrix
def temporal_covariance(x, y, lag):
'''Calculates the covariance between two time series given a lag'''
n = len(x)
# Creating two new arrays with the lag
x_shifted = x[:n-lag]
y_shifted = y[lag:]
# Calculate the covariance
covariance = np.cov(x_shifted, y_shifted)[0,1]
return covariance
# Constructing autocovariance matrix
feature_mtx = dataframe_scaled.values
leng_deep = feature_mtx.shape[1]
leng_tax = feature_mtx.shape[0]
AutoCorr = np.zeros([leng_deep-2,leng_tax])
for i in range(leng_tax):
for j in range(leng_deep-2):
AutoCorr[j,i] = temporal_covariance(feature_mtx[i], feature_mtx[i], j+1)
Now we have our data ready to go on the scene!
df_AutoCorr= pd.DataFrame(data = AutoCorr,
index = df_scaled.columns[0:109],
columns = df_scaled.index.get_level_values(0))
The Scene
Our data has now 41 columns that correspond to each pollen, but what if we reduce the dimension of the dataframedf_AutoCorr
and obtain one with much fewer columns that represents the greatest possible variability of our data. PCA is just made for that, let’s see how to code it.
pca = PCA(n_components=16)
pca.fit_transform(AutoCorr)
prop_var = pca.explained_variance_ratio_
eigenvalues = pca.explained_variance_
PCA_numbers = np.arange(pca.n_components_) + 1
plt.plot(PCA_numbers,
prop_var.cumsum(),
'o-')
plt.ylabel('Cumulative variance ratio', fontsize=8)
plt.show()
In the previous plot we can see how with 5 components we capture close to 60% of the variability of our data. That’s great!
components = 5
pca = PCA(n_components=components)
scores_pca = pca.fit_transform(AutoCorr)
Our next step is clustering the depth levels in order to see similarities and changes between them. There are several types of cluster methods, and your job it is to see which one is the best for your data and your problem.
I will use the well known KMeans method. Please note below that I chose 4 as the cluster number, but you can change it according to your data.
n_clusters = 4
kmeans_pca = KMeans(n_clusters = n_clusters, n_init = 20, random_state = 123)
kmeans_pca.fit(scores_pca)
The Story
We only have to build our mini pollen diagram and see what story is telling us. For this we are going to make a scatter plot where the size will be the magnitude of our original data and the color will be determined by the cluster.
Let’s see the code.
# Concatenated original data with the values of principal components and clusters
df_pca_kmeans = pd.concat([df[:109].reset_index(),
pd.DataFrame(scores_pca)], axis=1)
df_pca_kmeans.columns.values[-5: ] = ['Component_1', 'Component_2', 'Component_3',
'Component_4', 'Component_5']
df_pca_kmeans['Cluster'] = kmeans_pca.labels_
# Graphing
df_pca_kmeans_ = df_pca_kmeans.melt(id_vars=['Depth', 'Cluster',
'Component_1', 'Component_2', 'Component_3',
'Component_4', 'Component_5'])
fig = px.scatter(df_pca_kmeans_, x="Depth", y="variable",
size="value", color="Cluster",
hover_name="variable")
fig.update_layout(xaxis_title="Depth", yaxis_title="")
fig.show()
Pretty amazing, don’t you think?
In this plot we can see an overview of the thorny scrub. For example, there was a drastic change at depth 75 (purple to blue). At depth 171 (yellow), something happened, but it wasn’t drastic enough for the next ones to cluster the same. Something similar happens with depth 216 (purple) and 276 (orange). Also, level 216 is more similar to the first group (from 1 to 75).
Another thing we can look at are the magnitudes. There is a greater quantity for the Amaranthaceae pollen and after depth 270 all the magnitudes are very small.
What else does the plot is telling you?
Note: For reference, you can find all code here.
Thanks For Reading