Stories by Manasa Somanchi on Medium

Understanding Vectors from a Machine Learning Perspective

Manasa Somanchi — Fri, 16 Feb 2024 13:26:34 GMT

Vector:

· Mathematically, vectors encode length and direction.

· Theoretically, they represent a position or even a change in some mathematical framework or space.

· Vectors are used in machine learning as they form the most convenient way to organize data.

· We use vectors as inputs, the main use is their ability encode information in a format that our model can process, and then output something useful to our end goal.

Vectors as input and output

Examples where we use to represent input as vectors in

Machine learning:

1. Predict House prices

2. Create Bowie-esque lyrics

3. Sentence encoder

3-D Vectors

Vector Operations:

Addition:

We add two vectors together to create a third vector.

Vector Addition

Vector addition plays a crucial role in various aspects of machine learning. Here are some applications:

Feature Engineering: In machine learning, datasets are often represented as vectors where each feature is a dimension. Feature engineering involves adding, subtracting, or combining features to create new meaningful features. Vector addition can be used to combine features or create interaction terms, which can enhance the predictive power of a model.

2. Word Embeddings: In natural language processing (NLP), words are often represented as vectors in high-dimensional space known as word embeddings. Vector addition can be used to perform operations such as word analogy (e.g., king — man + woman = queen) or sentiment analysis by combining word embeddings of individual words to represent sentences or documents.

3. Neural Networks: Neural networks, especially in deep learning, utilize vector addition extensively. In feedforward neural networks, weights are represented as vectors, and during the forward pass, vector addition is performed to compute the activations of each neuron. In recurrent neural networks (RNNs) and transformers, vector addition is used for combining information from different time steps or attention heads.

4. Gradient Descent: In optimization algorithms like gradient descent, vectors representing gradients are added to update model parameters iteratively. This process helps in minimizing the loss function and finding the optimal set of parameters for the model.

5. Ensemble Methods: Ensemble methods combine multiple base models to create a more robust and accurate model. Techniques like bagging and boosting involve combining predictions from individual models through operations like averaging or weighted averaging, which essentially involve vector addition.

6. Data Augmentation: Data augmentation techniques are used to artificially increase the size of the training dataset by applying transformations such as rotation, translation, or scaling to the input data. These transformations can be represented as vectors, and vector addition is used to apply these transformations to the original data.

7. Reinforcement Learning: In reinforcement learning, vectors are often used to represent states, actions, and rewards. Vector addition can be used to update the state representation based on the current state and action taken, or to combine multiple reward signals.

In summary, vector addition is a fundamental operation in machine learning and is used in various stages of the learning process, including data representation, feature engineering, optimization, and model combination.

Multiplication:

By scalar multiplication we change the magnitude of the vector.

Dot product

Scalar — Dot Product

Cross Product

Vector — Cross Product

Vector multiplication is also extensively utilized in various applications within machine learning. Here are some key areas where vector multiplication plays a crucial role:

Matrix Operations in Neural Networks: Neural networks involve a significant amount of matrix operations, which inherently involve vector multiplication. Operations like matrix multiplication are used to compute the activations of neurons in different layers of a neural network during the forward pass. Additionally, during backpropagation, gradients are computed through matrix operations involving vector multiplication, such as matrix-vector products.

2. Kernel Methods: Kernel methods, such as Support Vector Machines (SVMs) and kernelized versions of algorithms like principal component analysis (PCA) and ridge regression, rely on vector multiplication through the use of kernel functions. These functions implicitly map input data into high-dimensional feature spaces, where vector multiplication captures complex relationships between data points.

3. Graph-based Learning: In graph-based learning tasks such as graph neural networks (GNNs), vector multiplication is used to aggregate information from neighboring nodes in a graph. Techniques like message passing involve multiplying node features with adjacency matrices or learned weight matrices to update node representations.

4. Dimensionality Reduction: Techniques like Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) utilize vector multiplication to decompose high-dimensional data matrices into lower-dimensional representations. These methods aim to capture the most important features of the data by multiplying the original data matrix with transformation matrices.

5. Recommender Systems: In collaborative filtering-based recommender systems, matrix factorization methods like Alternating Least Squares (ALS) and Singular Value Decomposition (SVD) are used to factorize the user-item interaction matrix into low-rank matrices. Vector multiplication is employed in these factorization processes to approximate the original matrix and generate recommendations.

6. Attention Mechanisms: Attention mechanisms, commonly used in sequence-to-sequence models like transformers, rely on vector multiplication to compute attention scores between different parts of input sequences. These attention scores are then used to weight the importance of different elements in the sequences during processing.

7. Graph Embeddings: Vector multiplication is used in graph embedding techniques such as node2vec and DeepWalk, where random walks are performed on graphs to generate sequences of nodes. These sequences are then used to learn embeddings for nodes by utilizing techniques like skip-gram models, which involve vector multiplication to train embeddings.

These are just a few examples of how vector multiplication is applied in machine learning. Overall, vector multiplication is a fundamental operation that enables various complex computations and transformations essential for many machine learning algorithms and models.

L1 Norm:

The L1 norm, also known as the Manhattan norm or taxicab norm, is a way of measuring the size of a vector in a space. It is defined as the sum of the absolute values of the components of the vector.

Mathematically, for a vector X with n components, the L1 norm (denoted as ||X||1) is calculated as:

L1 — Norm

Geometrically, the L1 norm represents the distance between the origin (0,0) and the point defined by the vector components in an n-dimensional space, measured along the axes, as if moving along the grid of a city street. Hence, it’s termed the Manhattan norm because it’s akin to the distance a taxi would travel along the streets of Manhattan from one point to another.

The L1 norm is often used in machine learning for various purposes, including:

Sparse Solutions: The L1 norm tends to produce sparse solutions when used as a regularization term in optimization problems. This property is exploited in techniques like Lasso regression, where the L1 regularization term encourages sparsity by penalizing the absolute values of the coefficients.
Feature Selection: In feature selection tasks, the L1 norm can be used to rank features based on their importance. Features with higher L1 norm values are considered more important as they contribute more to the overall magnitude of the vector.
Robustness to Outliers: The L1 norm is more robust to outliers compared to the L2 norm (Euclidean norm). This robustness makes it suitable for applications where the data may contain outliers or noise.
Distance Metric: The L1 norm can be used as a distance metric in clustering algorithms such as k-means. When calculating distances between data points, using the L1 norm results in a different notion of similarity compared to the more commonly used Euclidean distance.

Overall, the L1 norm is a useful mathematical tool in machine learning, offering distinct properties and applications compared to other norms like the L2 norm.

L1 — Norm

L2 Norm:

The L2 norm, also known as the Euclidean norm or the Euclidean distance, is a way of measuring the size of a vector in a space. It is defined as the square root of the sum of the squares of the components of the vector.

Mathematically, for a vector X with n components, the L2 norm (denoted as ∣∣X∣∣2) is calculated as:

square root of the sum of the squared values of the vector, also known as Euclidean distance.

L — 2 Norm

Geometrically, the L2 norm represents the distance between the origin (0,0) and the point defined by the vector components in an �n-dimensional space, measured along a straight line. It’s termed the Euclidean norm because it corresponds to the usual distance metric in Euclidean space.

The L2 norm is widely used in various applications in machine learning, including:

Least Squares Optimization: In regression problems, the L2 norm is commonly used as a regularization term in the form of Ridge regression. This regularization term penalizes the sum of the squares of the coefficients, helping to prevent overfitting and promote smoother solutions.
Neural Networks: The L2 norm is often used as a regularization technique in neural networks, known as weight decay. By adding a term proportional to the L2 norm of the weights to the loss function, the model’s weights are encouraged to stay small, preventing overfitting and improving generalization.
Distance Metric: The L2 norm is frequently used as a distance metric in clustering algorithms such as k-means. When calculating distances between data points, using the L2 norm results in a notion of similarity based on the straight-line distance between points in the feature space
PCA (Principal Component Analysis): In PCA, the L2 norm is used to calculate the variance of data along principal components. The principal components are chosen to maximize the variance of the data, which is calculated using the L2 norm.
Error Metrics: In various machine learning tasks, such as regression or classification, the L2 norm is used as an error metric to quantify the difference between predicted and actual values. This error, often referred to as the mean squared error (MSE) or the root mean squared error (RMSE), is minimized during model training.

Overall, the L2 norm is a fundamental concept in machine learning, widely used for regularization, distance measurement, error calculation, and variance analysis, among other applications.

L2 — Norm

Inner Product:

Geometrically, it is the product of the magnitude of the two vectors and the cosine of the angle between them.

Inner Product

One brief application of the inner product, also known as the dot product, is in computing the similarity between two vectors in machine learning and data analysis.

Given two vectors x and y, the inner product is calculated as:

Inner Product

Cosine:

To know how close two vectors are we use cosine angle. The cosine angle between 2 vectors is same even if the size differs.

Cosine

The cosine similarity is a measure of similarity between two vectors in an inner product space, commonly used in machine learning, natural language processing, and information retrieval. It measures the cosine of the angle between the two vectors and ranges from -1 to 1.

For two vectors a and b, the cosine similarity is calculated as:

Projections:

Projection of V into the direction of U

Projection Vector of V into the direction of U

Component of V into the direction of U

Component Vector of V into the direction of U

This operation essentially calculates the projection of one vector onto another, measuring the similarity or alignment between them. Here’s a brief application scenario:

Document Similarity: In natural language processing (NLP), documents are often represented as vectors of word frequencies or embeddings. To measure the similarity between two documents, their vector representations can be compared using the inner product. If the documents have similar content, their vectors will be aligned in the vector space, resulting in a higher inner product value. This similarity measurement can be used in tasks such as document retrieval, clustering, or recommendation systems.

For instance, in information retrieval, given a query vector representing the user’s query and a set of document vectors representing the corpus, documents with higher inner product values with the query vector are considered more relevant to the query. This approach is often used in search engines to rank search results based on their similarity to the user’s query.

Projection is very useful concept and often applied in one of the famous dimensionality reduction technique known as PCA (Principal Component Analysis).

Projection Vector

Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used in machine learning and data analysis to identify patterns in high-dimensional data and reduce its dimensionality while preserving most of its variance. Projections play a key role in PCA.

Here’s a brief application scenario of projections in PCA:

Image Compression: Consider a dataset consisting of images represented as high-dimensional arrays of pixel values. Each image can be viewed as a point in a high-dimensional space, with each dimension corresponding to a pixel. However, high-dimensional data can be computationally expensive to process and store.

PCA can be applied to this dataset to identify the principal components that capture the most variation in the images. These principal components form a set of orthogonal vectors in the high-dimensional space. By projecting the original images onto a lower-dimensional subspace spanned by these principal components, we can represent each image using a reduced set of features.

In the context of image compression, PCA can be used to reduce the dimensionality of the image data while preserving most of the information. The projected images, represented using a smaller number of principal components, can be stored more efficiently and transmitted more quickly. Despite the reduction in dimensionality, the projected images retain much of their visual quality, making PCA a valuable tool for image compression in applications such as digital photography, video streaming, and image processing.

That’s all folks…

see you again in the next article.

Singular Value Decomposition

Manasa Somanchi — Tue, 02 Jan 2024 10:53:52 GMT

In a nut shell…

To extract the important part of the matrix to make the computations easy, we make use of Singular Value Decomposition.

· It is a matrix factorization technique

· Can be applicable for matrix of any order.

· Should have the properties of Symmetric Matrix with respect to Eigen values and Eigen vectors.

· Overcomes the drawbacks of Eigen Decomposition.

Application in Data Compression:

We see that about thirty or fifty components, adding more singular

values don’t seem to improve visually in quality.

But, by the application of SVD the Image is compressed from

500 X 800 pixel image into matrix 50 X 500 matrix (for U), 50 singular

values 800 X 50 matrix (for V).

The math behind SVD will be posted soon…

Please feel free to comment also seeking claps that will serve as motivation to work on more…

Thank you!

Principal Component Analysis

Manasa Somanchi — Tue, 02 Jan 2024 10:29:09 GMT

Understanding in simple sense!

PCA is a dimensionality reduction technique used in ML including Feature Engineering and Feature Extraction.

It is a statistical procedure that transforms data linearly into new properties that are not correlated with each other.

The goal of PCA is to find a set of orthonormal bases of vectors for a given data matrix, such that the variance of the dataset projected onto the direction determined by the vectors is maximized.

Steps for PCA:

1. Compute the covariance matrix of the data

2. Compute the eigen values and vectors of this covariance matrix

3. Use the eigen values and vectors to select only the most important feature vectors and then transform your data onto those vectors for reduced dimensionality!

Information preserved by F2 >> information preserved by F1

We can drop F1 and convert from 2D to 1D

Information preserved by F2 = information preserved by F1

Here, we can’t reduce the dimension.

PCA performs linear orthogonal transformation on the data to find features F1’ and F2’ such that the variance on F1’ >> variance on F2’

ML Decision Boundary with Python Code

Manasa Somanchi — Wed, 20 Dec 2023 08:08:28 GMT

Decision boundary and plots — Part B

Bayes with joint, nonparametric (kernel density estimated) distributions:

Bayes Decision making with no normality assumption — (a)

Bayes Decision making with no normality assumption — (b)

NN — Neural Networks:

NN as the decision boundary (a)

NN as the decision boundary (b)

NN — Neural Networks with 2 Hidden Layers

NN with 2 Hidden Layers (a)

NN with 2 Hidden Layers (b)

Moreover, the research scope can be expanded by exploring various activation functions. Additionally, the problem can be extended to accommodate more than two categories. While the same algorithms may not be directly applicable, clustering methods and neural networks offer potential solutions for classifying data across multiple categories.

And the research begins.. see you in the next article!

Thank you!

ML Decision Boundary with Python Code

Manasa Somanchi — Wed, 20 Dec 2023 07:42:45 GMT

Decision boundary and plots — Part A

In any organization, the main business problem is to conclude using the power of Data Science. Most of the algorithms depend on outputs such as probability (0 to 1), not a categorical classification but a ‘Decision Boundary’

While training a classifier on a dataset, using a specific classification algorithm, it is required to define a set of hyperplanes called Decision Boundary. Decision boundaries are not confined to just the data points provided; they span through the entire feature space you trained on. Here are some illustrations of different problems and how different algorithms work in the creation of decision boundaries.

All the Decision boundaries are explained using different kinds of problems through Python code

Generating a random scatter plot with distinct colors is instrumental in effectively illustrating the decision boundary algorithms applied to the provided dataset.

Gihub — https://github.com/manasavamsi/ML-Decision-Tree/blob/main/Decision%20Boundary%20-%20Machine%20Learning.ipynb

Creating a problem

Problem- Scatter Plot

KNN — with k = 5

KNN — Decision Boundary

KNN — Decision Boundary with k =15

Logistic Regression — Simple Linear

Logistic Regression — Decision making

Logistic Regression — Polynomial

Logisitic Regression — with basic polynomials

Logistic Regression — with polynomials and interactions

Logistic Regression — with polynomials, interactions, and regularisation applied

Logistic Regression — smaller alpha — weaker regularization (a)

Logistic Regression — smaller alpha — weaker regularization (b)

Tree Methods — Decision Tree

Decision Tree — Decision Boundary (a)

Decision Tree — Decision Boundary (b)

Tree Methods — Random Forest

Random Forest Decision Boundary (a)

Random Forest Decision Boundary — (b)

SVM — Support Vector Machine with 4th order polynomial

SVM — 4th order

SVM — radial basis functions

SVM — with radial basic function

Bayesian & Probabilistic Methods

Gaussian Naive Bayes — Bayesian & Probabilistic methods

Gaussian Bayes, non-naive (joint distribution) (a)

Gaussian Bayes, non-naive (joint distribution) (b)

Continuation in Part B

Common Math Functions

Manasa Somanchi — Fri, 08 Dec 2023 08:46:47 GMT

Math for the day!

This article gives a gist of the common functions used in Data Science while building a conceptual model for data beginning with finding the relationship between the dependent variables to that of the independent variables.

What is a Function?

A function is a rule that defines a relationship between independent variables to that of dependant variables.

Mathematically, it is defined as a function f

is a relation from a set X to a set Y such that the domain of f is X and no two distinct ordered pairs in f have the same first element. Where X and Y are two non-empty sets.

List of some functions:

1. Linear Function

2. Square Function

3. Absolute Value Function

4. Logarithmic & Exponential Function

5. Tangent Function

6. Sigmoid Function

7. Softmax Function

8. ReLU Function

Linear Function:

Imagine you’ve planned for a grand party on a yacht. The initial non-refundable deposit is $1500 plus a daily charge of $300.

Defined as:

y - dependant variables that represent the final charges of the yacht

x - is the independent variable which represents, the number of days the yacht is hired for.

Square Function:

Imagine the entire path of a paper rocket projected upwards. This trajectory is plotted using the square function, which looks like an inverted parabola. The parabolic equation is :

Absolute Value of a Function:

A signed number whether it is positive or negative, specifies the direction. However, the absolute value of such numbers is still positive irrespective of the direction.

Defined as:

Example: Owing money to someone will be the same as someone owing money to you, as money is not considered negative.

Exponential Function & Logarithmic Function:

The function whose derivative is itself is the exponential function. The inverse of the exponential function is the Logarithmic function.

The exponential function can be illustrated with the growth of the death rate due to COVID-19. It is exponentially increasing with that of the increase of virus spread. Whereas the logarithmic function can be illustrated with the survival rate based on the immunity. If the immunity is 0 the chances of survival are negative otherwise it is positive and stable.

Tangent Function:

The ratio of the sine function to that of the cosine function gives you the tan function. An interesting application is thatyou can find the height of any object by knowing the distance with which you are standing away from the object along with the angle you are looking up at the object.

Sigmoid Function:

Also named, is the Logic function. The function introduces a non-linear curve which helps in deciding on the values.

Defined as:

For example, based on the real data on the COVID patients this function is modelled in such a way that it predicts the chances of a new patient getting affected. It is also used in the classification model.

Softmax Function:

An activation function is used in deep learning that scales down the values into probabilities.

Defined as:

x - input vector consisting of classes.

s(x)i -is the ratio of the exponent of the input value to that of the sum of all the input values which will be in the range 0 and 1.

For example, in the CNN model for image classification, the activation function results in the form of probabilities. These probabilities make predictions classifying the input image to its respective class which will range from 0 to 1.

ReLU Function:

It is also another activation function that gives the output either as a positive value or 0. Majorly used to function in NN. It’s advatageous because it overcomes the vanishing or exploding gradient problem where the sigmoid and tanh functions fail.

It is defined as:

- computationally less expensive.

Takes a few inputs at a time thus making the network sparse and easy to compute.

Data is huge, as it is everywhere and is generated in every second or even in every Picosecond. The research encompasses the extraction of huge data, preparation of databased on Domain Knowledge, drawing out non-trivial information from implicit data through math, statistical ideas using the technical approach, identifying the patterns, constructing the data to build conceptual models, and setting a relationship between data items and lastly visualizing the data for further inference and decision making is called Data Science.

Math Behind TF-IDF!

Manasa Somanchi — Fri, 26 Nov 2021 12:31:18 GMT

Today i am going to explain the most popular topic in NLP which is “Sentiment Analysis” a branch of “Text Analysis”.

TF-IDF is the one of the methods used to analyse text and helps us understand the impact of word towards relevant sentiments.

We have different sentiments or other words emotions that can be understood by normal human brain. But if the same emotions have to be understood by a machine we have to import some techniques and one of it is our topic TF-IDF.

It’s easy if the data is always in the form of numbers, isn’t it? But do you get the way you wanted? The answer is “No”. Data that we see in plenty, can be in the form of text, images, audio, videos etc. But our brains are well programmed to understand any type of data… but what about machines?

The only language that they understand is numbers. So how do machines understand the if we give the input in other forms, apart from numbers?

But how do you convert the text form of data into numbers? Yes,

This is where we land on a discussion about TFIDF.

In NLP, (Natural Language Processing),one of the major challenges is how to represent the textual form of data into numerical form of data. There are some patterns, similarities of the strings that we need to model into the numerical form in order to perform predictions on the new text data. So the learning of the machines should be intact and not only with single textual data.

How do we construct the solution for such a scenario?

By converting text to numbers or vectors which conveys the best meaning such that the machine could respond exactly the way we require.

Now this sounds like a plan, hmm.. but where would you come across these applications..?

How about analysis on sentiments. Yes! Right. I need a machine to collect all the comments about a movie and give me a proper justification on, how likely is the movie to watch. So now, in order to predict the same, how should I use the idea of converting the text to numbers.

Now, let’s assume that I take 3 comments by different viewers:

Comment 1: The movie is calm and pleasant. Best music.

Comment 2: Dozed off. Not to the level expected.

Comment 3: Good film! Music is nice.

Now looking at them we can figure out the likelihood of watching this movie. That is again based on one’s interest.

If one, likes to watch a pleasant, calm going and a slow music lover then definitely these comments that is out of 3, 2 comments give good similarity into the respective genre, simultaneously less similarity towards other genres.

See this is exact idea how TF-IDF works.

TF-IDF means Term frequency and Inverse Document Frequency.

So here, we have 3 comments that commented on “music”. Though the second comment used the word “songs” these words fall into same meaning. So those words that are similar are given the same numerical value. Also, same for the words like “Best” the value is slightly high that the words like “Good, Nice” and they fall on the same category that is positive. But how more positive and how less also matters.

The TF-IDF is the product of Term Frequency and Inverse Document Frequency

Where TF also known as count vectorizer gives the number of times, the words appear in the given data.

Now let’s count the number of terms appearing in the relevant comment:

The idea works if the less repeated words in the calculation of Tf-Idf are given more weights and vice versa. This leads to another topic which is TF-IDF weighing schemes. Which I shall talk about in detail in my next article.

Now, we have calculated the TF-IDF scores what’s next, how does it help me with showing the likelihood of the movie based on one’s interest.

Here, comes the query, your question that is: “Is the movie calm, pleasant nice etc”. Based on your query the TF-IDF is calculate as:

What did we observe here, is that based on our query the value of those words that are relevant to our search has increased and the rest became zero.

This is how we filter the comments based on our query resulting to show how many such comments were alike with our search. From this we can conclude how many comments were given similar to our search.

This article is for all math lovers, who want to learn how words can be analysed on sentiments using numbers. I hope this article brings to a detailed and in depth math behind TF-IDF and it’s working. In the next article, i will be sharing how these different weighting schemes could change the weights of the words and how it is affected to the Text analysis. Appreciate your valuable feedback. Thank you!!

Probability Distribution and its types

Manasa Somanchi — Fri, 26 Nov 2021 12:06:11 GMT

This is my first article and i wanted it to be less detail and easy way to understand with pictures.

A probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment.

Common Data Types:

Discrete Data — takes only specified values. Ex: Roll die possible outcomes are 1,2,3,4,5 or 6 not 1.5 or 2.45.

Continuous Data — takes values within given range finite or infinite values. Ex: weight of a person 54Kgs, 54.5Kgs or 54.536Kgs and so on.

Types of Probability Distribution:

Frequency Distribution

Bernoulli Distribution

Uniform Distribution

Binomial Distribution

Normal Distribution

Poisson Distribution

Exponential Distribution

These are the picture form of different types of Probability Distribution. In the next article, i shall explain each kind of probability distribution in detail with suitable examples. Hope you like this crisp and handy information. Appreciate your valuable feedback. Thank you!!