Understanding Vectors from a Machine Learning Perspective

Manasa Somanchi
10 min readFeb 16, 2024

--

Vector:

· Mathematically, vectors encode length and direction.

· Theoretically, they represent a position or even a change in some mathematical framework or space.

· Vectors are used in machine learning as they form the most convenient way to organize data.

· We use vectors as inputs, the main use is their ability encode information in a format that our model can process, and then output something useful to our end goal.

Vectors as input and output

Examples where we use to represent input as vectors in

Machine learning:

1. Predict House prices

2. Create Bowie-esque lyrics

3. Sentence encoder

3-D Vectors

Vector Operations:

Addition:

We add two vectors together to create a third vector.

Vector Addition

Vector addition plays a crucial role in various aspects of machine learning. Here are some applications:

  1. Feature Engineering: In machine learning, datasets are often represented as vectors where each feature is a dimension. Feature engineering involves adding, subtracting, or combining features to create new meaningful features. Vector addition can be used to combine features or create interaction terms, which can enhance the predictive power of a model.

2. Word Embeddings: In natural language processing (NLP), words are often represented as vectors in high-dimensional space known as word embeddings. Vector addition can be used to perform operations such as word analogy (e.g., king — man + woman = queen) or sentiment analysis by combining word embeddings of individual words to represent sentences or documents.

3. Neural Networks: Neural networks, especially in deep learning, utilize vector addition extensively. In feedforward neural networks, weights are represented as vectors, and during the forward pass, vector addition is performed to compute the activations of each neuron. In recurrent neural networks (RNNs) and transformers, vector addition is used for combining information from different time steps or attention heads.

4. Gradient Descent: In optimization algorithms like gradient descent, vectors representing gradients are added to update model parameters iteratively. This process helps in minimizing the loss function and finding the optimal set of parameters for the model.

5. Ensemble Methods: Ensemble methods combine multiple base models to create a more robust and accurate model. Techniques like bagging and boosting involve combining predictions from individual models through operations like averaging or weighted averaging, which essentially involve vector addition.

6. Data Augmentation: Data augmentation techniques are used to artificially increase the size of the training dataset by applying transformations such as rotation, translation, or scaling to the input data. These transformations can be represented as vectors, and vector addition is used to apply these transformations to the original data.

7. Reinforcement Learning: In reinforcement learning, vectors are often used to represent states, actions, and rewards. Vector addition can be used to update the state representation based on the current state and action taken, or to combine multiple reward signals.

In summary, vector addition is a fundamental operation in machine learning and is used in various stages of the learning process, including data representation, feature engineering, optimization, and model combination.

Multiplication:

By scalar multiplication we change the magnitude of the vector.

Dot product

Scalar — Dot Product

Cross Product

Vector — Cross Product

Vector multiplication is also extensively utilized in various applications within machine learning. Here are some key areas where vector multiplication plays a crucial role:

  1. Matrix Operations in Neural Networks: Neural networks involve a significant amount of matrix operations, which inherently involve vector multiplication. Operations like matrix multiplication are used to compute the activations of neurons in different layers of a neural network during the forward pass. Additionally, during backpropagation, gradients are computed through matrix operations involving vector multiplication, such as matrix-vector products.

2. Kernel Methods: Kernel methods, such as Support Vector Machines (SVMs) and kernelized versions of algorithms like principal component analysis (PCA) and ridge regression, rely on vector multiplication through the use of kernel functions. These functions implicitly map input data into high-dimensional feature spaces, where vector multiplication captures complex relationships between data points.

3. Graph-based Learning: In graph-based learning tasks such as graph neural networks (GNNs), vector multiplication is used to aggregate information from neighboring nodes in a graph. Techniques like message passing involve multiplying node features with adjacency matrices or learned weight matrices to update node representations.

4. Dimensionality Reduction: Techniques like Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) utilize vector multiplication to decompose high-dimensional data matrices into lower-dimensional representations. These methods aim to capture the most important features of the data by multiplying the original data matrix with transformation matrices.

5. Recommender Systems: In collaborative filtering-based recommender systems, matrix factorization methods like Alternating Least Squares (ALS) and Singular Value Decomposition (SVD) are used to factorize the user-item interaction matrix into low-rank matrices. Vector multiplication is employed in these factorization processes to approximate the original matrix and generate recommendations.

6. Attention Mechanisms: Attention mechanisms, commonly used in sequence-to-sequence models like transformers, rely on vector multiplication to compute attention scores between different parts of input sequences. These attention scores are then used to weight the importance of different elements in the sequences during processing.

7. Graph Embeddings: Vector multiplication is used in graph embedding techniques such as node2vec and DeepWalk, where random walks are performed on graphs to generate sequences of nodes. These sequences are then used to learn embeddings for nodes by utilizing techniques like skip-gram models, which involve vector multiplication to train embeddings.

These are just a few examples of how vector multiplication is applied in machine learning. Overall, vector multiplication is a fundamental operation that enables various complex computations and transformations essential for many machine learning algorithms and models.

L1 Norm:

The L1 norm, also known as the Manhattan norm or taxicab norm, is a way of measuring the size of a vector in a space. It is defined as the sum of the absolute values of the components of the vector.

Mathematically, for a vector X with n components, the L1 norm (denoted as ||X||1​) is calculated as:

L1 — Norm

Geometrically, the L1 norm represents the distance between the origin (0,0) and the point defined by the vector components in an n-dimensional space, measured along the axes, as if moving along the grid of a city street. Hence, it’s termed the Manhattan norm because it’s akin to the distance a taxi would travel along the streets of Manhattan from one point to another.

The L1 norm is often used in machine learning for various purposes, including:

  1. Sparse Solutions: The L1 norm tends to produce sparse solutions when used as a regularization term in optimization problems. This property is exploited in techniques like Lasso regression, where the L1 regularization term encourages sparsity by penalizing the absolute values of the coefficients.
  2. Feature Selection: In feature selection tasks, the L1 norm can be used to rank features based on their importance. Features with higher L1 norm values are considered more important as they contribute more to the overall magnitude of the vector.
  3. Robustness to Outliers: The L1 norm is more robust to outliers compared to the L2 norm (Euclidean norm). This robustness makes it suitable for applications where the data may contain outliers or noise.
  4. Distance Metric: The L1 norm can be used as a distance metric in clustering algorithms such as k-means. When calculating distances between data points, using the L1 norm results in a different notion of similarity compared to the more commonly used Euclidean distance.

Overall, the L1 norm is a useful mathematical tool in machine learning, offering distinct properties and applications compared to other norms like the L2 norm.

L1 — Norm

L2 Norm:

The L2 norm, also known as the Euclidean norm or the Euclidean distance, is a way of measuring the size of a vector in a space. It is defined as the square root of the sum of the squares of the components of the vector.

Mathematically, for a vector X with n components, the L2 norm (denoted as ∣∣X∣∣2​) is calculated as:

square root of the sum of the squared values of the vector, also known as Euclidean distance.

L — 2 Norm

Geometrically, the L2 norm represents the distance between the origin (0,0) and the point defined by the vector components in an �n-dimensional space, measured along a straight line. It’s termed the Euclidean norm because it corresponds to the usual distance metric in Euclidean space.

The L2 norm is widely used in various applications in machine learning, including:

  1. Least Squares Optimization: In regression problems, the L2 norm is commonly used as a regularization term in the form of Ridge regression. This regularization term penalizes the sum of the squares of the coefficients, helping to prevent overfitting and promote smoother solutions.
  2. Neural Networks: The L2 norm is often used as a regularization technique in neural networks, known as weight decay. By adding a term proportional to the L2 norm of the weights to the loss function, the model’s weights are encouraged to stay small, preventing overfitting and improving generalization.
  3. Distance Metric: The L2 norm is frequently used as a distance metric in clustering algorithms such as k-means. When calculating distances between data points, using the L2 norm results in a notion of similarity based on the straight-line distance between points in the feature space
  4. PCA (Principal Component Analysis): In PCA, the L2 norm is used to calculate the variance of data along principal components. The principal components are chosen to maximize the variance of the data, which is calculated using the L2 norm.
  5. Error Metrics: In various machine learning tasks, such as regression or classification, the L2 norm is used as an error metric to quantify the difference between predicted and actual values. This error, often referred to as the mean squared error (MSE) or the root mean squared error (RMSE), is minimized during model training.

Overall, the L2 norm is a fundamental concept in machine learning, widely used for regularization, distance measurement, error calculation, and variance analysis, among other applications.

L2 — Norm

Inner Product:

Geometrically, it is the product of the magnitude of the two vectors and the cosine of the angle between them.

Inner Product

One brief application of the inner product, also known as the dot product, is in computing the similarity between two vectors in machine learning and data analysis.

Given two vectors x and y, the inner product is calculated as:

Inner Product

Cosine:

To know how close two vectors are we use cosine angle. The cosine angle between 2 vectors is same even if the size differs.

Cosine

The cosine similarity is a measure of similarity between two vectors in an inner product space, commonly used in machine learning, natural language processing, and information retrieval. It measures the cosine of the angle between the two vectors and ranges from -1 to 1.

For two vectors a and b, the cosine similarity is calculated as:

Projections:

Projection of V into the direction of U

Projection Vector of V into the direction of U

Component of V into the direction of U

Component Vector of V into the direction of U

This operation essentially calculates the projection of one vector onto another, measuring the similarity or alignment between them. Here’s a brief application scenario:

Document Similarity: In natural language processing (NLP), documents are often represented as vectors of word frequencies or embeddings. To measure the similarity between two documents, their vector representations can be compared using the inner product. If the documents have similar content, their vectors will be aligned in the vector space, resulting in a higher inner product value. This similarity measurement can be used in tasks such as document retrieval, clustering, or recommendation systems.

For instance, in information retrieval, given a query vector representing the user’s query and a set of document vectors representing the corpus, documents with higher inner product values with the query vector are considered more relevant to the query. This approach is often used in search engines to rank search results based on their similarity to the user’s query.

Projection is very useful concept and often applied in one of the famous dimensionality reduction technique known as PCA (Principal Component Analysis).

Projection Vector

Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used in machine learning and data analysis to identify patterns in high-dimensional data and reduce its dimensionality while preserving most of its variance. Projections play a key role in PCA.

Here’s a brief application scenario of projections in PCA:

Image Compression: Consider a dataset consisting of images represented as high-dimensional arrays of pixel values. Each image can be viewed as a point in a high-dimensional space, with each dimension corresponding to a pixel. However, high-dimensional data can be computationally expensive to process and store.

PCA can be applied to this dataset to identify the principal components that capture the most variation in the images. These principal components form a set of orthogonal vectors in the high-dimensional space. By projecting the original images onto a lower-dimensional subspace spanned by these principal components, we can represent each image using a reduced set of features.

In the context of image compression, PCA can be used to reduce the dimensionality of the image data while preserving most of the information. The projected images, represented using a smaller number of principal components, can be stored more efficiently and transmitted more quickly. Despite the reduction in dimensionality, the projected images retain much of their visual quality, making PCA a valuable tool for image compression in applications such as digital photography, video streaming, and image processing.

That’s all folks…

see you again in the next article.

--

--

Manasa Somanchi

Math, Stats, Data Science, ML, AI enthusiast. Keen interest into research and development. Also intese math and stats problem solver.