Linear Algebra and it’s association with Data Science
Concepts of Mathematics can be so intriguing sometimes that we are convinced to believe that Data Science is nothing but Mathematics in disguise
Linear Algebra is a fundamental concept of mathematics and is heavily used in data science for various purposes, the most popular one being for solving Classification Problems. Before I get into the concepts of Linear Algebra I would like to give a brief introduction about the famous Iris Data Set. For those who don’t what it is let me say it’s the most basic real world data science problem which can be considered as an absolute beginner level problem to solve.
Problem Statement :
Given 150 samples of flower species which belong to the Iris family of flowers, we need to classify the samples into three broad categories namely:
- Iris Setosa
- Iris Virginica
- Iris Versicolor
We will use four inherent features of the flower species namely Petal Length, Petal Width, Sepal Length, Sepal Width to classify them into the three categories as mentioned above. I will not discuss about the actual procedure used to classify them in this blog. I will write about it in a different blog which is about Exploratory Data Analysis. In this blog my main area of interest would be concepts of Linear Algebra and how they are utilized in real world Data Science problems. For more information about Iris data set refer here
I will deal with Linear Algebra in 2-D space ( 2 Dimensional ), 3-D space (3 Dimensional) and extend the same concept to n-D space (n Dimensional). In Data Science and Machine Learning for every concept of Linear Algebra we learn it’s highly important to simultaneously connect the concept with it’s respective geometrical meaning.We have to learn a Linear Algebra concept and it’s corresponding geometrical meaning hand-in-hand.
Point / Vector :
A point in a 2-D Co-ordinate plane can be represented with a vector in Linear Algebra . In Mathematics and Physics a vector is a quantity which is defined by a couple of features namely magnitude and direction. We might have encountered this type of terminology during our days of high school. But we will not use this definition of vector in Data Science. For Data Science we define vector according to the norms of Computer Science. Computer Science folks who have done their Bachelor’s or Master’s in Computer Science or any of it’s related fields must be already knowing this definition of vector. In Computer Science, a vector simply means a one-dimensional array. The elements of this 1-D numerical array are same as components of a vector in Mathematics. For example
[2,3] is a vector (1-D numerical array) whose components are 2 and 3. In a 2-D plane the same can be illustrated as :
In 2-Dimensional plane : Consider a 2-D Co-ordinate plane which is divided into 4 quadrants using two axes namely X- axis and Y- axis. For a data set having only two variables , we can build a classifier for it by plotting it in a 2-D plane like above and we require two axes namely X- axis and Y- axis.
Special Note :
>Value of x- coordinate = distance from Y- axis.
>Value of y- coordinate = distance from X- axis.
In 3-Dimensional Space : Similarly for a data set having three variables one can build a classifier for it by plotting it in a 3-D space using 3 axes namely X- axis, Y- axis and Z- axis. A 3-D space is obtained by arranging 3 adjacent planes namely XY- plane, YZ- plane, ZX- plane.
For our example i.e., Iris Data set, four variables are being used for classification namely petal length, petal width, sepal length, sepal width so we need a 4-Dimensional space for plotting the data set and classify the data points. Unfortunately, it’s very difficult for any human eye to visualize anything beyond plots constructed in a 3-D space.
In real world the no. of variables that can be present for a given data set can be very large and sometimes they can be infinite also. In such cases it’s very difficult to plot them because for a data set having ‘n’ variables we need a n-dimensional space and “n” no. of axes for plotting which is theoretically possible but not practically possible. So we use some simple hacks to deal with this kind of difficulties and they are:
1. Pair plots
2. Dimensionality reduction
I will write about the above techniques in a different blog.
In n-Dimensional Space : As mentioned above we need n-Dimensional space to plot a data set having ‘n’ variables and we need ‘n’ no. of axises to achieve classification. Since there are only 26 letters in English alphabet, hence we do not name the axis as X,Y,Z,L,K etc… since at most we can have 26 axises by following this. Therefore, we name the axises as X1,X2,X3,…………….,Xn as a matter of general convention.
The term “classifier” refers to the mathematical function, implemented by a classification algorithm, that maps input data to a particular category.
A line or a curve acts as a classifier in a 2-D plane . Similarly a plane or a sphere or any other conic section may act as a classifier in 3-D space and a Hyper plane or Hyper sphere or any other Hyper- “name of conic section’’ acts as a classifier in a n-D space.
> Intersection of two lines is a point.
> Intersection of two planes is a line.
In the above figure a plane is acting as a classifier (separator) between green points and yellow points.But we can never claim that the plane is the accurate classifier because there are some mismatch points existing here and there. But that’s okay because constructing a really accurate classifier for a data set is practically difficult.
For a given data set, there may be many classifiers possible. For example, we could have done the above classification by using a sphere or a rectangle or circle may be , but there exists only one best optimal classifier with minimum number of mismatches. This is all about introduction to LA for Data Science.