Detecting the Fault Line Using Principal Component Analysis (PCA)

Naoko Suga
4 min readSep 6, 2018

--

Principal component analysis (PCA) is often used as a means of dimensionality reduction in machine learning. It linearly transforms a set of data into a lower dimensional space so that the new dataset has a maximum variance. It helps simplify the complexity in a model and handle multicollinearity while maintaining the trends/patterns. Simpler model means faster run time and lower dimension means easier to visualize- this algorithm is very powerful in such a way and it can greatly benefit data scientists. In addition to dimensionality reduction, PCA is also used to fit a linear regression that minimizes the perpendicular distances from the data to the fitted model when there is no natural distinction between predictor and response variable or when all the variables are measured with error. This is unlike the linear regression which assumes that the predictor variables are measured exactly while the response variables have errors.

Left: OLS — minimizing the error (= Residual Sum of Square) , Right: PCA — minimizing the orthogonal distance Source: http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/

Interestingly enough, PCA is sometimes used to fit a plane to the seismic data to predict the fault line. As we know that earthquakes occur on faults, finding the fault lines/structures is considered to be a key to improve our understanding of earthquake mechanisms and to earthquake forecasts. Nonetheless, it has been a great challenge for many- this is because only a limited part of the complex fault network has been revealed and there could be hidden faults (called blind faults) lying anywhere underneath.

locations of earthquakes on the San Andreas fault near Parkfield (x, y, z (in km), mag)

The dataset I worked with has 6,129 earthquake location data on the San Andreas fault. The above figure was first plotted as part of the exploratory data analysis. It shows the locations of hypocenter (x, y, z positions in km), the point where the earthquake rupture starts, and you can sort of see the trend across the x-y plane. Below plot on x-y axis clearly shows that a lot of hypocenter falls around the line in the middle (colors show the magnitude of the earthquakes).

* plot is after standardization

Now as we understand the vague distribution of the points, let’s fit a plane that would estimate the formation of the fault using PCA. First, the data was standardized sklearn’s Standardscaler:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)

Additionally, since the fault formation is very complex, fitting a plane to such a vast area wouldn’t be the smartest idea. So, an arbitrary section was chosen (-10<x<0 and -5<y<20). On the left shows the plot after standardization and for the selected area. Once this is done, now we can carry out our PCA. For this, scikit-learn’s principal component analysis library was used. Sklearn’s PCA function takes in a parameter, n_components, where you specify the number of components to keep. For this case, 3 was chosen since we would like to know all 3 principal axis. It also has an attribute, components_, and it returns the vectors that represent the principal axis of the data.

Since first two components form the basis of the plane and the last component is orthogonal to the plane, we know that the last component is the normal vector of the plane. Using this normal vector and the empirical mean per feature which lays on the plane, we are able to fit a plane to the data.

Plane can be expressed as the dot product of the normal vector and a vector on the plane and the equation of a plane can be simplified as:

a(x-xi)+b(y-yi)+c(z-zi) = 0

As we know the normal vector of the plane and a point that lays on the plane, by choosing an arbitrary x and y, we can compute the value of z. The orange plane on the left is the fitted plane.

Conclusion:

The fault in this area is actually known to be very close to vertical, but the fitted plane came out to be rather horizontal than vertical. This could come from how I randomly chose the area to fit the plane. Some of the data points could come from a different unknown fault structure around San Andreas fault, and this could be improved by performing clustering method before carrying out PCA. Also all the data points are weighed the same, but it is also know that the larger earthquakes mostly occur along the main fault line. So weighing each points by the magnitude of earthquake could help improving the fit.

--

--

Naoko Suga

Data Scientist and Machine Learning Engineer with a background in Physics research and financial analysis