Matrices in Data Science Are Always Real and Symmetric
Because data science deals with real-world problems, matrices in data science must be real and symmetric
Linear Algebra is a branch of mathematics that is extremely useful in data science and machine learning. Most machine learning models can be expressed in matrix form. Because data science deals with real-world problems, matrices in data science must be real and symmetric. There are some exceptions to this. In advanced data science models such as image processing, Fourier analysis is heavily used. Hence one could easily encounter matrices that are defined over the space of complex numbers. Other than that, for most basic data science and machine learning problems, the matrices encountered are always real and symmetric.
In this article, we will consider three examples of real and symmetric matrix models that we often encounter in data science and machine learning, namely, the regression matrix (R); the covariance matrix, and the linear discriminant analysis matrix (L).
Example 1: Linear Regression Matrix
Suppose we have a dataset that has 4 predictor features and n observations as shown below.
We would like to build a multi-regression model for predicting the y values (column 5). Our model can thus be expressed in the form
In matrix form, this equation can be written as
where X is the ( n x 4) features matrix, w is the (4 x 1) matrix representing the regression coefficients to be determined, and y is the (n x 1) matrix containing the n observations of the target variable y.
Note that X is a rectangular matrix, so we can’t solve the equation above by taking the inverse of X.
To convert X into a square matrix, we multiple the left-hand side and right-hand side of our equation by the transpose of X, that is
This equation can also be expressed as
is the (4 x 4) regression matrix. Clearly, we observe that R is a real and symmetric matrix. Note that in linear algebra, the transpose of the product of two matrices obeys the following relationship
Now that we’ve reduced our regression problem and expressed it in terms of the (4x4) real, symmetric, and invertible regression matrix R, it is straightforward to show that the exact solution of the regression equation is then
Example 2: Covariance Matrix
Suppose we have a highly correlated features matrix with 4 features and n observation as shown in Table 2 below:
To visualize the correlations between the features, we can generate a scatter plot. To quantify the degree of correlation between features (multicollinearity), we can compute the covariance matrix using this equation:
In matrix form, the covariance matrix can be expressed as a 4 x 4 real and symmetric matrix:
Again, we see that the covariant matrix is real and symmetric. This matrix can be diagonalized by performing a unitary transformation, also referred to as Principal Component Analysis (PCA) transformation to obtain the following:
Since the trace of a matrix remains invariant under a unitary transformation, we observe that the sum of the eigenvalues of the diagonal matrix is equal to the total variance contained in features X1, X2, X3, and X4.
Example 3: Linear Discriminant Analysis Matrix
Another example of a real and symmetric matrix in data science is the Linear Discriminant Analysis (LDA) matrix. This matrix can be expressed in the form
where S_W is the within-feature scatter matrix, and S_B is the between-feature scatter matrix. Since both matrices S_W and S_B are real and symmetric, it follows that L is also real and symmetric. The diagonalization of L produces a feature subspace that optimizes class separability and reduces dimensionality. Hence LDA is a supervised algorithm, while PCA is not.
For more details about the implementation of LDA, please see the following references:
In summary, we’ve discussed three examples of real and symmetric matrices in data science and machine learning, namely, the regression matrix (R); the covariance matrix, and the linear discriminant analysis matrix (L). Because data science deals with real-world problems, matrices in data science must be real and symmetric.
Additional Data Science/Machine Learning Resources
For questions and inquiries, please email me: firstname.lastname@example.org