Principal Component Analysis on Time Series Data

Ross Brancati
9 min readJan 13, 2023

--

Image created by the author.

Introduction

Time series data analysis involves analyzing a sequence of data points collected over an interval of time. Time-series data could be a physiological signal such as heart rate, weather trends, stock prices, or movement patterns. There are many ways to analyze these types of data. In human movement science applications, discrete values are typically selected from time series signals, however, this process disregards a large portion of the data which could still be significant for determining differences in patterns between signals. Principal component analysis (PCA) is an alternative method that reduces the dimensionality of the signal while retaining most of its information. In short, this method can automatically detect components of the signal that have the most amount of variation and where the largest differences between groups in the signal occur. This is particularly useful for differentiating movement patterns such as those with a certain pathology in human movement science. In this article, I will review some basics of PCA and show how it can be applied to time-series signals, specifically with a joint angle signal example.

Some basic math behind PCA

Let’s start with our data matrix of n observations and v variables (i.e., a matrix with dimensions [n x v], where each v represents a timepoint across the time-series signal. Mathematically, the variables X = x₁, x₂, …, xᵥ are transformed to v newly uncorrelated variables Z= z₁, z₂, …, zᵥ, where z₁ represents the direction of the highest amount of variation in the data. The second principal component, z₂, represents the second highest direction of variation in the data and is exactly orthogonal to the first principal component. Z = UᵗX, where U is a matrix of the eigenvectors of the covariance matrix of X. That is, the principal components, Z, can be modeled with the eigenvectors and the original data. I won’t go into Eigen decomposition, but here is a link to a video that explains eigenvectors and eigenvalues: https://youtu.be/PFDu9oVAE-g. With this decomposition, we get v eigenvalues, where each eigenvalue is related to the total amount of variation in the data that its respective eigenvector explains. Using these eigenvalues, we can generate a ratio of the total amount of variation explained by each principal component (more to come on this later). Additionally, we can reconstruct the original data from the eigenvectors and the principal components by rearranging the equation above: X = UZ.

PCA generates some important information that we need for fitting statistical or machine learning models and for interpretation purposes — the principal component scores and the eigenvectors. In short, the principal component scores represent the relationship between a data point and its respective principal component. Since these scores are discrete values, we can run statistics on them or use them as training/testing data for machine learning models. The eigenvectors represent the important patterns of the time-series data, which help us to interpret the meaning of each principal component. PCA interpretation is a challenging and somewhat subjective task, but it is feasible using the eigenvectors. More to come later regarding principal component scores and eigenvectors. There are a ton of resources online that take a deeper dive into computing the covariance matrix, eigenvectors and eigenvalues, and other mathematical concepts behind PCA.

Background on human movement time series data

Biomechanists use several variables to explore patterns of human movement. Some examples of these variables are joint angles (kinematics), joint moments/loadings (kinetics), muscle activity data (electromyography), angular velocities, and linear accelerations. Optical motion capture is the gold standard for collecting kinematic data, which is the example that I will walk through in the rest of this post. Kinematic signals are typically time normalized from 0–100% of the gait cycle, that is from when one heel hits the ground to when the same heel hits the ground again. In this example, I will use the knee flexion angle as the signal because it is intuitive, and a common metric computed in biomechanical studies. The following image depicts the entire gait cycle, where the blue box represents the start of the gait cycle (first heel strike), and the red box represents the end of the gait cycle (second heel strike).

Image credit: https://www.orthobullets.com/foot-and-ankle/7001/gait-cycle

The corresponding knee flexion signal looks something like this, with the blue and red arrows corresponding to the first and second heel strikes depicted in the image above. The green arrow corresponds to the first peak value and the orange arrow corresponds to the second peak value (more on this later).

Image created by the author.

A typical analysis methodology is to extract discrete values from the kinematic signals such as first and second peak values shown in the plot above. Once these are extracted, you have variables that you can compare in a statistical model. The limitation of this is that you are throwing out 98–99% of the data, which could still contain important information for answering your research question. An alternative approach, as you could guess by now, is applying a PCA to the data matrix consisting of these kinematic signals.

Applying PCA to time series data

Let’s start with a data matrix consisting of n observations and v variables, where each v is a percentage of the gait cycle. This data matrix would look something like this

Image created by the author.

Most programming languages (python, R, matlab) have built-in PCA functions that do the math for you. For this analysis, and most of my other analyses, I use python with the scikit-learn and pandas libraries. Once the data is imported and in matrix format like the one shown above, it is simple to run PCA and export the principal component scores and eigenvectors. The code below shows how to fir the PCA and export the scores, eigenvectors, and explained variance ratio. The explained variance ratio is key for determining how much of your data you want to keep for building statistical or machine learning models (more to come on this later). Here is a snippet of code showing these processes

#import PCA library and pandas
from sklearn.decomposition import PCA
import pandas as pd

#initialize PCA
pca = PCA()

#fit PCA model
#fitting the PCA model using fit_transform gets the principal component scores
scores = pca.fit_transform(X=X)
#create a data frame of the PC scores
scores_df = pd.DataFrame(scores)

#export the scores
pd.DataFrame.to_csv(scores_df,os.getcwd()+'PCA_Analysis/PC_scores.csv', sep=',', index=False)

#get the principal component loadings (also known as eigenvectors)
loadings = pd.DataFrame(pca.components_)
#export the loadings
pd.DataFrame.to_csv(loadings,os.getcwd()+'PCA_Analysis/PC_loadings.csv', sep=',', index=False)

#get explained variance
explained_variance = pd.DataFrame(pca.explained_variance_ratio_.round(2))
#export the explaiend variance
pd.DataFrame.to_csv(explained_variance,os.getcwd()+'PCA_Analysis/PC_explained_variance.csv', sep=',', index=False)

Now we have everything we need (scores, eigenvectors, and explained variance) to interpret the principal components and fit statistical or machine learning models.

Quick aside on explained variance: remember that PCA is a dimensionality reduction method that reduced the dimensionality while reattaining most of the amount of information in the data. In movement science, we typically want to retain at least 90% of the information in the original data, which is typically captured by the first 3–4 principal components. The explained variance ratios quantify the total explained variance of each principal component, so we keep the components that collectively explain 90% of the variation in the data and disregard the rest of them. This way, we have reduced the dimension of our data from 100% of a gait cycle to a few principal components that explain the majority of the original data.

Early on, I mentioned that the original data can be recreated from the eigenvectors (also known as principal component loadings) and the principal components. To visualize this, we can multiply each principal component by the respective score, sum all of these up, and plot the summation. Visually, this would look something like this

Image created by the author.

That’s pretty much it for the principal component analysis. The next step would be to fit a statistical or machine learning model using the principal component scores and interpret the results. In this project, I was comparing two groups of runners — a healthy group and a group with a running-related knee injury (I’ll refer to this group as the symptomatic group). I used all of the principal components that explained at least 90% of the variation in the data for each kinematic, kinetic, and electromyography variable collected in the study. I fed the principal component scores into an interpretable machine learning classification model (classified the healthy and symptomatic groups) and extracted the components that were most influential on the classification process (I’ll spare the specifics, but if you’re interested, we are currently publishing a paper on the full study). The next step is to interpret the principal components.

Interpreting Principal Components

Interpreting the principal components can be done using a few different methods. I used a method that has been used in other movement science studies (see Boyer et. al 2012). The first step is to create an ensemble average waveform of all of the signals in the original matrix, X. Next, multiply the eigenvector of the principal component by the average score of the group (and a scaling factor if necessary). Repeat this process for both groups (healthy and symptomatic). Add this vector and the ensemble average for both groups independently. Lastly, plot these two vectors with the original ensemble signal. If the principal component has important information for differentiating the two groups, you should see some points across the gait cycle where these newly generated vectors (from the principal component scores and eigenvectors) deviate from the overall ensemble average. Where these deviations occur are the time points across the gait cycle where the two groups have differences in the signal. A visualization of this procedure should help

Image created by the author.

In our example of the knee flexion angle, we see that this principal component shows that there are differences between the healthy and symptomatic runners at three different time points across the gait cycle depicted by the green arrows — early gait cycle, mid-gait cycle (also known as the toe-off phase), and the peak knee flexion angle. Had we analyzed this signal by only extracting discrete values such as the first or second peaks, we would have missed these aspects of the gait cycle that are important for differentiating the two groups. As it turns out, the knee flexion angle is an important signal for this particular running-related injury.

Conclusion

PCA is a very valuable dimensionality reduction tool for a number of applications such as image analysis and building highly dimensional machine learning models, however, it can also be applied to time-series data. Through an analysis of a kinematic signal, we were able to show that differences in movement patterns exist between a healthy group of runners and a symptomatic group of runners with just a few lines of code. I hope that this article helps to explain this process so that you can implement a similar analysis with your time-series data. A jupyter notebook is available on my Github with some data cleaning/pre-processing steps and the principal component analysis (scroll down to Step 8).

References:

Boyer, K. A., Federolf, P., Lin, C., Nigg, B. M., & Andriacchi, T. P. (2012). Kinematic adaptations to a variable stiffness shoe: Mechanisms for reducing joint loading. Journal of Biomechanics, 45(9), 1619–1624. https://doi.org/10.1016/j.jbiomech.2012.04.010

Deluzio, K. J., & Astephen, J. L. (2007). Biomechanical features of gait waveform data associated with knee osteoarthritis. An application of principal component analysis. Gait and Posture, 25(1), 86–93. https://doi.org/10.1016/j.gaitpost.2006.01.007

--

--

Ross Brancati

PhD candidate at the University of Massachusetts Amherst. Interested in human performance and data science.