Vectors, Matrices and Vectorization

Terms that are most commonly heard in Data science and ML world.

Hareesha Dandamudi
5 min readApr 18, 2022

Vectors and Matrices are concepts of Linear algebra. In this mini write up I will try and give very brief overview of terms vectors , matrices and a short note on vectorization. Vectors and Matrices are by far the most fundamental way of representing data in ML and Data Science. I will make my best effort to keep it simple and understandable.

Vector is a point in n dimensional space

Vector is composed of both direction and magnitude. Magnitude of a vector is nothing but the length of the vector, which can be calculated using distance formula (Pythagoras theorem).

In terms of data, each and every instance of a row or an observation can be thought of as a vector. They are really helpful to represent numerical data.

Dimensionality of a vector can be determined based on the space that a vector is part of, ℝⁿ (Real number space).

For example, let’s take Iris dataset

The columns in this dataset are Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm, Species. We can treat each instance(row or observation) of this dataset has 6 dimensions, which are also called features of the dataset (here n = 6)

Multi dimensional vectors can be plotted in 2-D or 3-D , using dimensionality reduction. Visual representation of vectors when plotted may look like

Geometric vectors

Vector arithmetic :

There will be many instances where we need to add/subtract two or more vectors. Conventionally vectors is represented in lowercase bold letter.

𝑤 = [1, 5.1, 3.5, 1.4, 0.2] a row vector

Eg : v , w are python lists not vectors, adding directly would only result in concatenated lists. So, used zip.

Addition/subtraction of two or more vectors result in a new vector, where vector elements are added component-wise or index wise.

Adding vectors

Dot product of vectors is sum of element wise product and it result in a scalar(real number) value.

Given two vectors , dot product v.w is the length of the vector that would result when vector v is projected on to vector w.

Dot product v.w

It is not a good idea to use python lists for vectors as they are not performant enough for huge data sets. I have only used lists to explain the main idea. Python’s NumPy library supports multi dimensional arrays and offers various arithmetic operations on those arrays with best performance.

Matrix is a two dimensional collection of numbers.

Matrix is primarily used to store numeric tabular data in a more efficient manner .They are composed of rows(horizontal) and columns(vertical) and can be viewed as list of lists. Matrices are represented in uppercase letters.

Matrix with 1 row and n columns or n rows and 1 column, is nothing but a vector.

Matrix with 2 rows and 3 columns. A = [[1,2,3],
[1,0,2]]

Matrix arithmetic includes addition, subtraction, scalar multiplication, dot product etc. Here is the useful link (actually pretty through) that explains all of the operations and properties of matrices in detail.

Matrices has broad range of applications like representing datasets, images, text, weights in Neural networks and more. Visualizing dataset/information in the form of matrices brings more intuition and relationship between entities are better understood.

Vectorization is a process of executing a single instruction on a set of data points (vector) at a time.

Vectorization makes the code execute faster. Any loop execution where each element is sequentially processed can be replaced with a vector instruction.Vector operations are faster as they make use of data parallelism, which occurs on a single core. Fundamentally, vector operations are intrinsic to modern day CPUs with SIMD(Single Instruction and Multiple Data) architecture. Primitive operations on vectors and matrices of linear algebra are usually the good candidates for vectorization.

Example : If a CPU has 512 bit register, then it can hold 16 of 32 -bit single precision double instructions at a single point in time. If we write a loop then each 32-bit instruction executes sequentially one at a time, while wasting lot of register space. All of this is at hardware level. In order to use this hardware power, the code needs to be in vectorized format so that the underlying implementation can make use of the capability where ever it is apt.

import randoma = random.sample(range(1,10),5)  # [1, 7, 8, 6, 5]
b = random.sample(range(1,10),5) # [8, 3, 2, 5, 6]
#non vectorized
z_non_vect = [i+j for i,j in zip(a,b)] # [9, 10, 10, 11, 11]
#vectorized
import numpy as np
c = np.array(a) # [1 7 8 6 5]
d = np.array(b) # [8 3 2 5 6]
z_vect = c + d # [ 9 10 10 11 11]

The above code fragment, z_non_vect is non vectorized way of adding and it does not use the full power of the underlying hardware. Where as in the second part of the code fragment, the arithmetic operation is performed on multiple components of vectors c and d at once.The difference in the execution time will only be evident when it runs on considerably huge amount of input data.

Note : Vectorization may not be possible in all the scenarios. It is best suited for the problems that require same operation to be performed on each element in the vector or a matrix. If there is a dependency between elements like read after write, indirect memory access, non straight line code etc may impact vectorization capability of code.

NumPy library uses the concept of vectorization(in addition to indexing) to make the array operations faster. Not only NumPy library, there are other programming languages like octave, R etc also has this support.

That’s about it. Happy learning!

References:

Books-Data Science from Scratch: First Principles with Python 1st Edition

https://en.wikipedia.org/wiki/Single_instruction,_multiple_data

https://www.intel.com/content/www/us/en/developer/articles/technical/vectorization-a-key-tool-to-improve-performance-on-modern-cpus.html

--

--