How Mathematical or Linear Algebra Objects(Vector, Matrices and Tensor) used in AI to store the different dimensions of Data.

Linear Algebra- How it is used in AI ?

Shafi
Analytics Vidhya
Published in
12 min readAug 27, 2020

--

Understand How Linear Algebra is applying in AI.

How Linear Algebra (Mathematical Objects) is using in Artificial Intelligence.

Sub-fields in AI

Well, Artificial Intelligence is not a single subject it has sub-fields like Learning (Machine Learning & Deep Learning), Communication using NLP, Knowledge Representation & Reasoning, Problem Solving, Uncertain Knowledge & Reasoning.

In this article will explain how objects and its properties are using in AI’s sub-fields ML, NLP, DL,etc., algorithms.

Describing the sub-field concepts where LA Objects can be applied.

Going through each sub-field explain bit to the concerned topic and how applying it. The following diagram explain the areas where we apply Linear Algebra in AI.

LA Objects applying in these areas of AI

Note: Please note that Data representation, Data Processing are not the sub areas of AI, these are using in ML,DL & NLP areas.

In the above diagram other sub-areas like Problem solving, Knowledge representation and knowledge reasoning LA objects are used but not as much in Learning(ML/DL) and NLP.

Describing LA Objects & properties in these sub-fields

Linear Algebra or Mathematical objects are Vectors, Matrices and Tensors. Depend upon the dimensions of your data you have to choose the right object to store and process, Title diagram describes this.

Before starting how to use Mathematical Objects in AI, it is better to refresh Linear Algebra.

Data representation: Explained in terms of Mathematical Objects Vector, Matrix and Tensor.

Data set: It is a collection of examples or data points or objects. Each example is a collection of features. Each example is a row and feature is a column.

Design Matrix: A Data set can be described through Design Matrix. A Design Matrix is a matrix containing a different example in each row. For example:

Design Matrix representation

If the data is not in particular order, i.e., columns are not same for each example/row. In such cases we describe as set containing ‘m’ elements of which has different vector size.

In supervised learning, data set contains a Label or target as well as collection of features.

Design Matrix for Supervised Learning

Data Processing: Before we use Data sets in our ML algorithms or in any sub-field of AI, it is necessary that the data set should be ready (cleansed & filtered).

There are 3 forms of Data processing Mean subtraction, Normalization and PCA &whitening. These forms described in short in the below diagram.

Data processing operations explained through Numpy

These 3 form access Matrix and produces the desired. The 3rd form PCA is used for dimensionality reduction and it is totally works in pure linear algebra, following algorithm describes it.

PCA Algorithm for Training Data

Some of the operations used in Data selection, engineering, Data cleansing, etc., argmin and argmax are the operations in Data Processing. It works on Matrices and vectors and selecting the rows or columns of minimum or maximum respectively.

Here axis can be column or row. Axis 0 (zero) means Column, axis 1(one) means Row.

argmin: Returns the indices of the minimum values along an axis.

argmax: Returns the indices of the maximum values along an axis.

Machine Learning (ML) : ML is an algorithmic based approach that learns from training data and give decisions on unseen data. There are many algorithms exists in ML for supervised and unsupervised learning.

How LA concepts applied in ML-Regression Algorithm: Here describes how Linear Algebra applies to Regression analysis. Explaining the concepts through Linear Multiple Regression Algorithm. The following diagram describes LA concepts in ML and DL.

LA Objects, properties and usages in ML and DL

Regression Analysis explain in terms of Vectors, Matrices and their properties.

What is Regression? It is a statistical technique for estimating the relationships between a dependent and independent variables.

The most common form of regression analysis is Linear Regression

In the following equations will describe Simple and Multiple Linear Regression.

Simple & Multiple Regression with examples

This technique predicts continuous responses — for example, forecasting stock prices, House Rent, etc.,.

Residual: In Machine Learning/statistical terminology, it is a difference between the observed value and the estimated value of the target variable.

Notation is given below:

Notation for observed & estimated value of target variable
Residual in Multiple regression

Sum of Squares of Residuals: Let’s define residual as ‘r’.

Least Square method: Least squares method is the standard approach and it minimizes the Sum of Squares of Residuals ‘S’.

Ordinary Least Squares (OLS) or Linear Least Squares estimate the parameters in a regression model by minimizing the sum of the squares of residuals. It draws a line through the data points that minimize the SSE between observed and predicted (or fitted or estimated ) values.

The most important application is data fitting.

Data Fitting: It is the process of constructing a curve fitting or mathematical function, that has the best fit to a set of data points.

Curve fitting can be linear or non-linear. The following describes both curves.

Linear Curves:

Linear Curve

After the introduction of Regression Analysis let us define the loss and cost function of it.

Loss Function: The Loss function of Linear Regression is defined is as follows

The loss function of Regression

Finding out the parameters by differentiation w.r.to parameters.

Finding weights or parameters by applying a gradient on the Loss function

What is Regularization: To avoid the over-fitting problem, the regularization technique is used to shrink the magnitude of Parameters. This can be achieved by adding a penalty (a function of the sum of parameters) into the cost function. L1, L2, Drop out and Max norm constraints used in DL, whereas L1, L2, L1+L2 used in ML.

If you are using neural networks for ML algorithms you can apply all of the above 4 regularization techniques.

L2 Regularization: It is the most common form of Regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective.

L1 Regularization: Each weight w we add the term param*|w| to the objective function, both L1, L2 is defined is as follows:

Generalized Regularization

Use of Vector Norms in Machine Learning for Regularization:

Vector Norms in Regularization to avoid the Over fitting problem

Deep Learning (DL): It is a branch of ML and deeply learns the text, images, or videos. Unstructured data like images or videos can be processed using DL. There are many applications of DL like Image Processing ( Computer Vision using CNNs), Video Processing (Computer Vision using RNNs), Text Processing (NLP using RNNs, LSTMs ), etc., we can combine with Reinforcement Learning (DEEP RL).

DL is inspired by Neurons. One neuron gets connected with multiple neurons and applies activation function at the neuron.

Vectors, Matrices and Tensors are objects used in DL area. The Following diagram is the sample of a Neural Network and describes Input, Neurons, Layers, Feed Forward Propagation, Back-propagation, etc.,

There are many Mathematical Subjects involved in Deep Learning, in this article Linear Algebra is considered. Describing how Mathematical objects are being used in each stage.

Common Neural Network Architecture

Input: Input is in the form of Vectors, Matrices or Tensors to the Neural Network. Finally each data object/sample will be in Vectors. Here Input is a vector of n - dimensions. It is an example or data point in the data set.

Neurons or Nodes: Here we apply activation function for the input of previous layer and weights or connections. It is an interconnected group of natural or Artificial Neurons that uses mathematical or computational model for information processing based on a connectionistic approach to computation.

Connections: Connections of the biological neuron are modeled as weights.

Each Neuron will be connected to other neurons in next layer

Layer: Each layer contains set of neurons the following picture depicts.

Layer Contains neurons and will operated in vector level.

Feedforward Propagation: These are called Deep feedforward networks or feedforward neural networks or Multilayer perceptrons (MLPs). These are called feedforward because information flows through the function being evaluated from x, through the intermediate computations used to define f, and finally to the output y.

Feedforward neural networks are called networks because they are typically represented by composing together many different functions.

Let us say for an example our network has 3 functions connected in a chain, to form

These chain structures are used structure of neural networks.

Let us see how we are applying using vectors and matrices in Feed forward Networks.

  1. Vectorizing Inputs, Weights and Bias : x: input vector of n-dimensions; w-weight matrix of n rows and m neurons in the next layer, and bias of m neurons in the next layer.
Overall Calculation of Input, Weights and Bias into temporary variable Z

From this it is concluded that

Generalized Approach

2. Apply intermediate Variable Z into activation function

Feedforward into next layer

3. The above steps repeated and results getting feed to next layer in the forward way.

At each neuron Intermediate calculation & activation function will be is as follows:

Consider an example of Neural Network of Input 2-Features, 3-Hidden Layers and 1-Output Layers with dimensions 3,5,4,2,1 hidden units.

Neural Network with 1-Input, 4 Hidden and 1 Output Layers

Let us apply Vector, Matrix operations for forward propagation.

Forward Propagation for 4 layers

Know your Matrix dimensions:

Dimensions of the Matrix, Vector in Feed forward propagation

Feedforward Propagation = Matrix-Vector product rule, addition of matrices along with activation functions.

Back Propagation = Matrix Calculus + Linear Algebra Product Rules — will cover in the next article.

Natural Language Processing (NLP): NLP is concerned with the interactions between human and computers, in particular how to program computers to process and analyze large amounts of natural language data.

Here we describe Word2Vector (W2V)technique that is for NLP. In Word2Vec represents each distinct word with a particular list of numbers called a Vector. Based on W2V we can apply vector properties for checking the similarity and semantic similarity between vectors.

In NLP we use Vectors and Matrices is as follows:

Vectors and Matrices using in W2V Algorithms

W2V used in many of the tasks in NLP and it is the base of capturing word in to vector. Natural Language Text = Sequence of discrete symbols

Produce Dense vector representation based on the context /use of words.

What is Target & Context words: Consider a text instance with context window size =2. Following describes

Context and Target/Current Words

How to represent One-hot representation?

Vocabulary: The set of words encoded in to the feature vector is called the Vocabulary, so the dimension of vector is equal to the size of the Vocabulary. In short, |V| = size of the Vocabulary.

Let us say our text data set contains the following lines

  1. “And the Cute kitten purred and then …
  2. “ Cute furry cat purred and miaowed…”
  3. “ That the small kitten miaowed and she ..”
  4. “ the loud furry dog ran and bit… ”

From these 4 sentences basis vocabulary : { bit, cute, furry, loud, miaowed, purred, ran, small} — 8 is the vocabulary length. Let us define target and context words.

Target Word: Kitten, Context words: { Cute, purred, small, miaowed}

Target Word: Cat, Context words: { Cute, Furry, miaowed}

Target Word: Dog, Context words: { Loud, Furry, ran, bit }.

Now we represent as a vector of vocabulary length 8.

Words as Vectors

We defined the vectors as when context words appears then specify ‘1’, otherwise ‘0’ at the dimension of vector.

Checking the similarity between vectors: To check the similarity we can use the Inner product (or) cosine as similarity kernel.

Sim(Kitten,Cat)=Cosine(Kitten,Cat)~0.58; Sim(Kitten, Dog)= Cosine(Kitten, Dog) ~ 0.00; Sim(Cat,Dog)=Cosine(Cat,Dog)~0.29

Cosine, Dot and Cross product between vectors

Embedding Matrix: Embedding Matrix can be defined as Rows -> Target words and Columns -> Number of Context words are length of context window

Embedding Matrix Dimension

Rows are word vectors, so we can retrieve them with one hot vectors

Word representation using one hot
Embedding Matrix with row as Target word and its context words

Algorithm for constructing Embedding Matrix:

Steps to construct Embedding Matrix

A Vector that captures the meaning of a Word. It can also be known as Word2Vec, Word Emebedding. Following are the algorithms

  1. Skip-gram (SG) : Predicting Context words given by the target word
  2. Continuous bag of words (CBOW): Predicts target word given by the context words
  3. Glove: It makes use of global co-occurrence statistics. Glove consists of a weighted least squares model that trains on global word-word co-occurrence counts.

The above 3 algorithms explained in the usage of Linear Algebra .

Step-1: Skip-gram (SG): The objective of the skip-gram (SG) model is to maximize the average log probability

Describing Context and Target words

Step-2: Project into Vocabulary Softmax

Step 3: Learn to estimate likelihood of Context words

SKIP-GRAM

Continuous bag of words (CBOW): It predicts target or current word based on its context words. Its possibility distribution would be

  • Project back to vocabulary size / softmax
  • Embed context words, add them.
Expressing current word in the form of softmax of vector-matrix product rules in LA
CBOW

GLOVE: Like word2vec, Glove is a set of vectors that capture the semantic information (i.e., meaning about words. It consists of a weighted least squares model that trains on global word-word co-occurrence counts.

Glove makes use of Global-occurrence statistics.

Co-occurrence matrix: We define the this matrix using the following corpus.

I like deep learning; I like NLP; I enjoy flying.

Co-occurrence Matrix

Let X be the word-word co-occurrence counts matrix.

Like the case in word2vec, each word has 2 vectors, input(v) and output(u)

Cost Function of Glove model

Conclusion: Described how Linear Algebra applied in various fields of AI, it is better to be keen in Linear Algebra stuff before we move on ML, DL or NLP. I tried to cover how to apply Linear Algebra stuff in algorithmic perspective, I hope it may give strength to be involve more into Linear Algebra.

Linear Algebra promotes to other subjects like Matrix Calculus which is heavily used in Back propagation in DL.

Thanks for reading this article, please drop a note if there are any mistake(s) and appreciated your feedback.

References :

  1. Artificial Intelligence: A Modern Approach by Stuart Russell, Peter Norvig,
  2. Deep Learning Book by Ian Goodfellow and Yoshua Bengio and Aaron Courville
  3. https://en.wikipedia.org/wiki/Regression_analysis
  4. http://web.stanford.edu/class/cs224n/
  5. Efficient Estimation of Word Representations in vector space
  6. https://nlp.stanford.edu/projects/glove/

--

--

Shafi
Analytics Vidhya

Researcher & Enthusiast in AI, Quantum Computing, Blackholes, and Astrophysics.