Stories by Dhivya on Medium

Hierarchical Clustering Algorithm In Machine Learning

Dhivya — Mon, 14 Nov 2022 23:29:38 GMT

What is Hierarchical Clustering?

Hierarchical clustering is a popular method for grouping objects. It creates groups so that objects within a group are similar to each other and different from objects in other groups. Clusters are visually represented in a hierarchical tree called a dendrogram.

Hierarchical clustering has a couple of key benefits:

1. There is no need to pre-specify the number of clusters. Instead, the dendrogram can be cut at the appropriate level to obtain the desired number of clusters.

2. Data is easily summarized/organized into a hierarchy using dendrograms. Dendrograms make it easy to examine and interpret clusters.

There are mainly two types of hierarchical clustering:

1. Agglomerative hierarchical clustering

2. Divisive Hierarchical clustering

Let’s understand each type in detail.

Agglomerative hierarchical clustering:

Initially, each object is considered to be its own cluster. According to a particular procedure, the clusters are then merged step by step until a single cluster remains. At the end of the cluster merging process, a cluster containing all the elements will be formed.

Agglomerative Clustering

Step 1: Each data point is assigned to a cluster

Step 2: Merge the clusters based on a metric for the similarity between clusters(The Euclidean distance function is commonly used for this operation.)

Step 3: Update the distance matrix

Step 5: Repeat Step 2and Step 3until only a single cluster remains

Divisive Hierarchical clustering:

The Divisive method is the opposite of the Agglomerative method. Initially, all objects are considered in a single cluster. Then the division process is performed step by step until each object forms a different cluster. The cluster division or splitting procedure is carried out according to some principles that maximum distance between neighboring objects in the cluster.

Divisive Hierarchical clustering

Between Agglomerative and Divisive clustering, Agglomerative clustering is generally the preferred method. The below example will focus on Agglomerative clustering algorithms because they are the most popular and easiest to implement.

Hierarchical clustering employs a measure of distance/similarity to create new clusters.

Computing a proximity matrix

The first step of the algorithm is to create a distance matrix. The values of the matrix are calculated by applying a distance function between each pair of objects. The Euclidean distance function is commonly used for this operation. The structure of the proximity matrix will be as follows for a data set with n elements.

Let’s make the 5 x 5 proximity matrix for our example:

We will use the Euclidean distance formula to calculate the rest of the distances. So, let’s say we want to calculate the distance between point P1 and P2

Euclidean Distance Formula:

Euclidean Distance Formula

For Example,

D = √(10–7)² = √9 = 3

Similarly, we can calculate all the distances and fill the proximity matrix.

The diagonal elements of this matrix will always be 0 as the distance of a point with itself is always 0.

Next, we will look at the smallest distance in the proximity matrix and merge the points with the smallest distance.

the smallest distance is 3 and hence we will merge P1 and P2

Let’s look at the updated clusters and accordingly update the proximity matrix

P1,P2=C1 (Cluster1)

We then update the proximity matrix

Here, we have taken the maximum of the two marks (7, 10) to replace the marks for this cluster( C1=10). Instead of the maximum, we can also take the minimum value or the average values as well. Now, we will again calculate the proximity matrix for these clusters

We will repeat step until only a single cluster is left.

So, we will first look at the minimum distance is 7 (P3,P5 = C2)in the proximity matrix and then merge the closest pair of clusters. We will get the merged clusters as shown below after repeating these steps

We will get the merged clusters as shown below after repeating these steps

Here,

(C1,p4) =C3

Finally,

(C2,C3) = C4(One Cluster)

We started with 5 clusters and finally have a single cluster. This is how agglomerative hierarchical clustering works

Dentogram:

A Dendrogram is a diagram that represents the hierarchical relationship between objects. The Dendrogram is used to display the distance between each pair of sequentially merged objects.

These are commonly used in studying hierarchical clusters before deciding the number of clusters significant to the dataset.

The distance at which the two clusters combine is referred to as the dendrogram distance.

The primary use of a dendrogram is to work out the best way to allocate objects to clusters.

For Example,

Here, we have merged sample P1 and P2. The vertical line represents the distance between these samples.

Similarly, we plot all the steps where we merged the clusters and finally, we will get a dendrogram.

We can clearly visualize the steps of hierarchical clustering. More the distance of the vertical lines in the dendrogram, more the distance between those clusters.

Now, we can set a threshold distance and draw a horizontal line (Generally, we try to set the threshold in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and draw a horizontal line

The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold. In the above example, since the red line intersects 2 vertical lines, we will have 2 clusters. One cluster will have a sample (P1,P2,P4) and the other will have a sample (P3,P5).

Distance Between Two Clusters

The different types of linkages describe the different approaches to measure the distance between two sub-clusters of data points. The different types of linkages are:-

Some of the popular linkage methods are:

· Simple Linkage

· Complete Linkage

· Average Linkage

· Centroid Linkage

· Ward’s Linkage

Single Linkage(Min Method): In single linkage, we define the distance between two clusters as the minimum distance between any single data point in the first cluster and any single data point in the second cluster.

Formula:

Single Linkage

Picture 1 and Piture 2

Pros of Single-linkage:

single linkage is fast, and can perform well on non-globular data

Cons of Single-linkage:

This approach cannot separate clusters properly if there is noise between clusters.

Complete Linkage(Max Method): In complete linkage, we define the distance between two clusters to be the maximum distance between any single data point in the first cluster and any single data point in the second cluster.

Formula:

Picture 1 Vs Picture2

Pros of Complete-linkage:

This approach gives well-separating clusters if there is some kind of noise present between clusters.

Cons of Complete-Linkage:

This approach is biased towards globular clusters.

It tends to break large clusters.

Average Linkage: Defines distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster.

Formula:

Pros of Average Linkage

The average Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Average Linkage

The average Linkage method is biased towards globular clusters.

Centroid Linkage:The Centroid method defines the distance between clusters as being the distance between their centers/centroids. After calculating the centroid for each cluster, the distance between those centroids is computed using a distance function.

Formula:

Pros of Centroid Linkage:

The Centroid Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Centroid Linkage:

Similar to Complete Linkage and Average Linkage methods, the Centroid Linkage method is also biased towards globular clusters.

Ward’s Method:

The Ward approach analyzes the variance of the clusters rather than measuring distances directly, minimizing the variance between clusters.

Ward method attempts to minimize the sum of the squared distances of the points from the cluster centers. Compared to the distance-based measures described above, the Ward method is less susceptible to noise and outliers. Therefore, Ward’s method is preferred more than others in clustering.

Formula:

Pros of Ward’s Linkage

In many cases, Ward’s Linkage is preferred as it usually produces better cluster hierarchies

2. Ward’s method is less susceptible to noise and outliers.

Cons of Ward’s Linkage

Ward’s linkage method is biased towards globular clusters.

Applications

There are many real-life applications of Hierarchical clustering. They include:

Bioinformatics: grouping animals according to their biological features to reconstruct phylogeny trees

Business: dividing customers into segments or forming a hierarchy of employees based on salary.

Image processing: grouping handwritten characters in text recognition based on the similarity of the character shapes.

Information Retrieval: categorizing search results based on the query.

Strengths of Hierarchical Clustering

· It is to understand and implement.

· We don’t have to pre-specify any particular number of clusters.It Can obtain any desired number of clusters by cutting the Dendrogram at the proper level.

· They may correspond to meaningful classification.

· Easy to decide the number of clusters by merely looking at the Dendrogram.

Limitations of Hierarchical Clustering

· Hierarchical Clustering does not work well on vast amounts of data.

· Does not work very well with missing data

·Algorithm can never undo what was done previously.

· Time complexity of at least O(n2 log n) is required, where ’n’ is the number of data points.

·Based on the type of distance matrix chosen for merging different algorithms can suffer with one or more of the following:

i) Sensitivity to noise and outliers

ii) Breaking large clusters

iii) Difficulty handling different sized clusters and convex shapes

Handling Numerical Values, Underfitting, Overfitting and Hyperparameter tuning in Decision Trees:

Dhivya — Tue, 18 Oct 2022 00:51:14 GMT

Handling Numerical Values, Underfitting, Overfitting and Hyperparameter tuning in Decision Trees:

Steps to split a decision tree using Information Gain, For each split, individually calculate the entropy of each child node. Calculate the entropy of each split as the weighted average entropy of child nodes. Select the split with the lowest entropy or highest information gain.

The decision tree split for numerical variables millions of records: The time complexity right for operating this operation is very huge keep on increasing as the number of records gets increased decision tree with to numerical variables takes a lot of time for training.

Overfitting and Underfitting in Decision Tree :

In Decision Trees, continue to grow the tree fully until each leaf node corresponds to the lowest impurity, then the data have typically been overfitted. If splitting is stopped too early, error on training data is not sufficiently high and performance will suffer due to bais. Thus, preventing overfitting & underfitting are pivotal while modeling a decision tree

Hyperparameter Tuning using GridSearchCV both Classification and Regression:

Let’s use GridSearchCV for Hyperparameter tuning in decision tree,

Hyperparameter tuning using GridSearchCV- Classification

Hyperparameter tuning using GridSearchCV- Regression

In Classification ,

Criterion -”gini” (or) Entropy

Gini impurity is a function that determines how well a decision tree was split. Basically, it helps us to determine which splitter is best so that we can build a pure decision tree. Gini impurity ranges values from 0 to 0.5 and Entropy impurity range values from 0 to 1.

In Regression,

Criterion=’’mse” or “mae”

MSE(Mean Square Error)is the average squared difference between the actual data values and where the data point would be on the proposed line. The tree runs an algorithm that finds the line that results in the smallest MSE.

max_depth=”none”

The first parameter to tune is max_depth. This indicates how deep the tree can be. The deeper the tree, the more splits it has and it captures more information about the data.

max_features =”none”

The number of features to consider when looking for the best split. If this value is not set, the decision tree will consider all features available to make the best split.

max_leaf_nodes=”none”

This hyperparameter sets a condition on the splitting of the nodes in the tree and hence restricts the growth of the tree.

minimum_impurity_decrease =0.0

The node impurity is a measure of the homogeneity of the labels at the node. The current implementation provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for regression (variance).

min_impurity_split=’’none’’

Min_impurity_split parameter can be used to control the tree based on impurity values. It sets a threshold on gini. For instance, if min_impurity_split is set to 0.3, a node needs to have a gini value that is more then 0.3 to be further splitted.

min_samples_leaf =1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

min_samples_split=2

min_samples_split specifies the minimum number of samples required to split an internal node, while min_samples_leaf specifies the minimum number of samples required to be at a leaf node. For instance, if min_samples_split = 5 , and there are 7 samples at an internal node, then the split is allowed.

min_weight_fraction_leaf =0.0

It is the fraction of the input samples required to be at a leaf node where weights are determined by sample weight, this is a way to deal with class imbalance.

DECISION TREES IN MACHINE LEARNING:

Dhivya — Fri, 14 Oct 2022 06:24:49 GMT

Decision trees are used in the supervised type of machine learning. The approach can be used to solve both regression or classification problems. Overall, classification trees are the main use of decision trees in machine learning, but the approach can be used to solve regression problems too. The main difference is in the type of problem and data. Classification trees are used for decisions such as yes or no, with a categorical decision variable. Regression trees are used for a continuous outcome variable such as a number. Classification And Regression Tree (CART) is general term for this.

Important Terminology related to Decision Trees :

1. Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.

2. Splitting: It is a process of dividing a node into two or more sub-nodes.

3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.

4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.

5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.

6. Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.

7. Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

ID3 Algorithm in Decision Trees:

ID3( Iterative Dichotomiser 3) is a Algorithm, uses a top-down greedy approach to build a decision tree. In simple words, the top-down approach means that we start building the tree from the top and the greedy approach means that at each iteration we select the best feature at the present moment to create a node.

ID3 uses Entropy and Information Gain to construct a decision tree for classification.

The ID3 algorithm can be used to construct a decision tree for regression by replacing Information Gain with Standard Deviation Reduction.

· Start from the root node with all data.

· For each node, calculate the information gain of all possible features.

· Choose the feature with the highest information gain. Split the data of the node according to the feature

· Do the above recursively for each leaf node, until ,there is no information gain for the leaf node Or there is no feature to select

Decision Tree — Classification :

Entropy :

ID3 uses Entropy and Information Gain as attribute selection measures to construct a Decision Tree.

A Decision Tree is built top-down from a root node and involves the partitioning of data into homogeneous subsets. To check the homogeneity of a sample, ID3 uses entropy. It is measures purity of the split

Formula:

Example :

Lets Refer to the below sample to predict whether the player will play golf or not,

We calculate the entropy of target variable,

Now, We will find out the entropy of each column with respect to target variable to find the most homogeneous

Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an example of an action that provides information that is random.

From the above graph, it is quite evident that the entropy H(X) is zero when the probability is either 0 or The Entropy is maximum when the probability is 0.5 because it projects perfect randomness in the data and there is no chance if perfectly determining the outcome.

In Entropy we are getting weather the split is pure or not. But in decision tree we have many no of other features like this. We should know that which pattern of split is effective to know. Information gain is helpful to find out weather split gives higher value

Information Gain:

Information gain or IG is a statistical property that measures how well a given attribute separates the training examples according to their target classification. Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.

S — Dataset of the Parent node

A- Feature split

Sv -Subset after splitting

|S| — Total no of Samples

Information gain is a decrease in entropy. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values.

Example:

We will find out the information Gain using Formula.

Information gain

Notes:

Calculate Entropy of the root node
2. Calculate the entropy for each branch .Add up together to get the total Entropy of the split.
3. The branch with largest Information Gain is considered as the root node/decision node.

Gini Impurity :

Gini Impurity is a measurement used to build Decision Trees to determine how the features of a dataset should split nodes to form the tree. More precisely, the Gini Impurity of a dataset is a number between 0–0.5, which indicates the likelihood of new, random data being misclassified if it were given a random class label according to the class distribution in the dataset.

It only performs binary splits either yes or no, success or failure, and so on. So it will only split a node into two sub-nodes. These are the properties of Gini impurity.

Formula :

Consider a dataset D that contains samples from i classes. The probability of samples belonging to class i at a given node can be denoted as pi. Then the Gini Impurity of D is defined as:

Example:

Yes =3

No =1

Total no of samples =3+1 = 4

Gini index = 1 -[(3/4)2 +(1/4)2]

=1 -[(0.75)2 +(.25)2]

=1–0.625

=0.38

Gini Impurity

Gini Vs Entropy :

The range of Entropy lies in between 0 to 1 and the range of Gini Impurity lies in between 0 to 0.5. The internal working of both methods is very similar and both are used for computing the feature/split after every new splitting. But if we compare both the methods then Gini Impurity is more efficient than entropy in terms of computing power.

Decision Tree — Regression :

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a datasets into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

The ID3 algorithm can be used to construct a decision tree for regression by replacing Information Gain with Standard Deviation Reduction.

Standard Deviation :

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). We use standard deviation to calculate the homogeneity of a numerical sample. If the numerical sample is completely homogeneous its standard deviation is zero.

Standard deviation for one attribute:

Standard Deviation (S) is for tree building (branching).

Coefficient of Deviation (CV) is used to decide when to stop branching. We can use Count (n) as well.

Average (Avg) is the value in the leaf nodes.

Standard deviation for two attributes (target and predictor) :

Standard Deviation Reduction :

The standard deviation reduction is based on the decrease in standard deviation after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest standard deviation reduction (i.e., the most homogeneous branches).

Standard deviation (Hours Played) = 9.32

The dataset is then split on the different attributes. The standard deviation for each branch is calculated. The resulting standard deviation is subtracted from the standard deviation before the split. The result is the standard deviation reduction.

The attribute with the largest standard deviation reduction is chosen for the decision node.

The data set is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches, until all data is processed.

In practice, we need some termination criteria. For example, when coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch .

“Overcast” subset does not need any further splitting because its CV (8%) is less than the threshold (10%). The related leaf node gets the average of the “Overcast” subset.

However, the “Sunny” branch has an CV (28%) more than the threshold (10%) which needs further splitting. We select “Temp” as the best best node after “Outlook” because it has the largest SDR.

Because the number of data points for both branches (FALSE and TRUE) is equal or less than 3 we stop further branching and assign the average of each branch to the related leaf node.

Moreover, the “rainy” branch has an CV (22%) which is more than the threshold (10%). This branch needs further splitting. We select “Temp” as the best best node because it has the largest SDR.

Because the number of data points for all three branches (Cool, Hot and Mild) is equal or less than 3 we stop further branching and assign the average of each branch to the related leaf node.

When the number of instances is more than one at a leaf node we calculate the average as the final value for the target.

Advantages and Disadvantages of Decision Tree

Advantages:

It can be used for both classification and regression problems: Decision trees can be used to predict both continuous and discrete values i.e. they work well in both regression and classification tasks.
As decision trees are simple hence they require less effort for understanding an algorithm.
It can capture nonlinear relationships: They can be used to classify non-linearly separable data.
An advantage of the decision tree algorithm is that it does not require any transformation of the features if we are dealing with non-linear data because decision trees do not take multiple weighted combinations into account simultaneously.
They are very fast and efficient compared to KNN and other classification algorithms.
Easy to understand, interpret, visualize.
The data type of decision tree can handle any type of data whether it is numerical or categorical, or boolean.
Normalization is not required in the Decision Tree.
The decision tree is one of the machine learning algorithms where we don’t worry about its feature scaling. Another one is random forests. Those algorithms are scale-invariant.
It gives us and a good idea about the relative importance of attributes.
Useful in data exploration: A decision tree is one of the fastest way to identify the most significant variables and relations between two or more variables. Decision trees have better power by which we can create new variables/features for the result variable.
Less data preparation needed: In the decision tree, there is no effect by the outsider or missing data in the node of the tree, that’s why the decision tree requires fewer data.
Decision tree is non-parametric: Non-Parametric method is defined as the method in which there are no assumptions about the spatial distribution and the classifier structure.

Disadvantages:

Concerning the decision tree split for numerical variables millions of records: The time complexity right for operating this operation is very huge keep on increasing as the number of records gets increased decision tree with to numerical variables takes a lot of time for training.
Similarly, this happens in techniques like random forests, XGBoost.
Decision tree for many features: Take more time for training-time complexity to increase as the input increases.
Growing with the tree from the training set: Overfit pruning (pre, post), ensemble method random forest.
Method of overfitting: If we discuss overfitting, it is one of the most difficult methods for decision tree models. The overfitting problem can be solved by setting constraints on the parameters model and pruning method.
As you know, a decision tree generally needs overfitting of data. In the overfitting problem, there is a very high variance in output which leads to many errors in the final estimation and can show highly inaccuracy in the output. Achieve zero bias (overfitting), which leads to high variance.
Reusability in decision trees: In a decision tree there are small variations in the data that might output in a complex different tree is generated. This is known as variance in the decision tree, which can be decreased by some methods like bagging and boosting.
It can’t be used in big data: If the size of data is too big, then one single tree may grow a lot of nodes which might result in complexity and leads to overfitting.
There is no guarantee to return the 100% efficient decision tree.

Support Vector Machine In Machine Learning Algorithms

Dhivya — Thu, 13 Oct 2022 05:42:42 GMT

Support Vector Machine(SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. It is widely used in classification problems. SVM works well on smaller datasets and complex datasets as well. It is extremely popular because of their ability to handle multiple continuous and categorical variables.

Types of SVM:

1. Linear separable

2. Non-Linear separable

How SVM Works?

Linear Separable:

Linearly separable means When we can easily separate data with hyperplane by drawing a straight line is Linear SVM. Typically used for linear regression and classification problems

Linearly separable Dataset

In classification , SVM will plot the data in n-dimensional space (n means number of features in data set) and it creates a decision boundary which makes the distinction between two or more classes. This decision boundary is also called as Hyperplane.

The main objective of the SVM is to select the hyperplane with the maximum possible margin between support vectors in the given data set.

Generate hyperplanes which segregates the classes in the best possible way. There are many hyperplanes that might classify the data. We should look for the best hyperplane that represents the largest separation, or margin, between the two classes.

The Parallel positive Hyperplane and negative Hyperplane is identified based on Support Vectors. It is nothing but hyperplane crossing through closest positive data points and negative data points

So, we choose the hyperplane so that distance from it to the support vectors on each side is maximized. If such a hyperplane exists, it is known as the maximum margin hyperplane or Hard margin. The wider the margin, the better it is for the classification task.

As margin increases, the “generalization accuracy” (Accuracy of the model on future unseen data points) increases.

Points:

Margin — The distance between the positive and negative hyper-plane is called the margin.

Support Vector — Data points that are closest to the hyperplane is called support vectors

Hyperplane/Decision Boundary − As we can see in the above diagram, it is a decision plane or space which is divided between a set of objects having different classes.

Positive hyperplane — The hyper-plane that is touching the points of the positive class is called the positive hyper-plane.

Negative hyperplane — The hyper-plane that is touching the points of the negative class is called the negative hyper-plane.

How to Choose Decision Boundary?

In this above diagram, the similar way we can create multiple hyperplane. But when we create a hyperplane wee should focus on margin. In this picture,Z1 margin distance is less compare to Z2. The main aim is to maximize the margin distance to get the better result. Based on that we should select the margin with maximum distance. So we may use Z2 margin for this datasets. It is more generalized model.

Hard Margin and Soft Margin:

Hard Margin Formula:

We have now found our optimization function but there is a catch here that we don’t find this type of perfectly linearly separable data in the industry. This type of problem is called Hard Margin SVM

Soft Margin Formula:

To make a soft margin equation we add 2 more terms to this equation which is zeta and multiply that by a hyper parameter ‘c’

Parameter C is a regularization parameter used to set the tolerance of the model to allow the misclassification of data points in order to achieve lower generalization error. Higher the value of C, lesser is the tolerance and what is trained is a maximum-margin classifier. Smaller the value of C, larger is the tolerance of misclassification and what gets trained is a soft-margin classifier that generalizes better than maximum-margin classifier.

The C value controls the penalty of misclassification. A large value of C would result in a higher penalty for misclassification and a smaller value of C will result in a smaller penalty of misclassification. With a larger value of C, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. The model may overfit with the training dataset. A lower C will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy.

For all the correctly classified points our zeta will be equal to 0 and for all the incorrectly classified points the zeta is simply the distance of that particular point from its correct hyperplane that means if we see the wrongly classified green points the value of zeta will be the distance of these points from Positive hyperplane and for wrongly classified redpoint zeta will be the distance of that point from Negative hyperplane.

Hard Margin vs Soft Margin

In this Data set,using SVM algorithm, will see about better model,The margin is maximum and has 2 misclassified points or the one where the margin is very less, and all the points are correctly classified

SVM Error = Margin Error + Classification Error

If you don’t want any misclassification in the model then you can choose figure 2. That means we’ll increase ‘c’ to decrease Classification Error but if you want that your margin should be maximized then the value of ‘c’ should be minimized. That’s why ‘c’ is a hyperparameter and we find the optimal value of ‘c’ using GridsearchCV and cross-validation.

Cross-Validation is used while training the model. As we know that before training the model with data, we divide the data into two parts — train data and test data. In cross-validation, the process divides the train data further into two parts — the train data and the validation data. Grid-search along with cross-validation helps to evaluate the best hyper parameters.

Non-Linear Separable(Kernel Trick):

In Non linear Datasets, we cannot separate data with a straight line.In machine learning, a trick known as “kernel trick” is used to learn a linear classifier to classify a non-linear dataset. It transforms the linearly in — separable data into a linearly separable one by projecting it into a higher dimension.

A kernel function is applied on each data instance to map the original non-linear data points into some higher dimensional space in which they become linearly separable.

They transform non-linear spaces into linear spaces. The main aim of the SVM kernel does some kind of transformation and it is Converts from Low Dimension to(2-D) to High Dimension (3-D) and classifies the data with the help of Hyperplane

Kernel Functions:

1. Radial Basis Function(RBF) Kernel

2. Polynomial Kernel

3. Sigmoid Kernel

KBF Kernel:

When the data set is non-linear separable it is recommended to use kernel functions such as RBF. For a linearly separable dataset (linear dataset) one could use linear kernel function (kernel=”linear”). Getting a good understanding of when to use kernel functions will help train the most optimal model using the SVM algorithm. Given that the dataset is non-linear, it is recommended to use kernel method and hence kernel function such as RBF.

Formula:

Kernel Parameter — Gamma Values :

The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The lower values of gamma result in models with lower accuracy and the same as the higher values of gamma. It is the intermediate values of gamma which gives a model with good decision boundaries.

The plots below represent decision boundaries for different values of gamma with the value of C set as 0.1 for illustration purposes.

Note that as the Gamma value increases, the decision boundaries classify the points correctly. However, after a certain point (Gamma = 1.0 and onwards in the diagram below), the model accuracy decreases. It can thus be understood that the selection of appropriate values of Gamma is important. Here is the code which is used.

svm =SVC(kernel=’rbf’,random_state=1,gamma=0.008, C=0.1)

svm. Fit(X_train, y_train)

· When gamma is very small (0.008 or 0.01), the model is too constrained and cannot capture the complexity or “shape” of the data. The region of influence of any selected support vector would include the whole training set. The resulting model will behave similarly to a linear model with a set of hyperplanes that separate the centers of a high density of any pair of two classes. Compare with the diagram in the next section where the decision boundaries for a model trained with a linear kernel is shown.

· For intermediate values of gamma (0.05, 0.1, 0.5), it can see on the second plot that good models can be found.

· For larger values of gamma (3.0, 7.0, 11.0) in the above plot, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with C will be able to prevent overfitting.

Polynomial Kernel:

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

The effect of the degree of a polynomial kernel. Higher degree polynomial kernels allow a more flexible decision boundary. The style follows that of 3.

suppose we have two features X1 and X2 and output variable as Y, so using polynomial kernel we can write like,

we basically need to find X1**2 , X2**2 and X1.X2, and now we can see that 2 dimensions got converted into 5 dimensions.

The advantage of using this kernelized version is that you can specify the degree to be large, thus increasing the chance that data will become linearly separable in high-dimensional space, without slowing the model down.

Sigmoid Kernel :

It is interesting to note that an SVM model using a Sigmoid kernel function is equivalent to a two-layer, perceptron neural network 3. There are two adjustable parameters in this kernel, Slope — alpha and constant C — intercept.

The Sigmoid Kernel comes from the Neural Networks field, where the bipolar sigmoid function is often used as an activation function for artificial neurons.

How to choose the right Kernel?

It is necessary to choose a good kernel function because the performance of the model depends on it.Choosing a kernel totally depends on what kind of dataset are you working on. If it is linearly separable then you must opt. for linear kernel function since it is very easy to use and the complexity is much lower compared to other kernel functions. I’d recommend you start with a hypothesis that your data is linearly separable and choose a linear kernel function.

You can then work your way up towards the more complex kernel functions. Usually, we use SVM with RBF and linear kernel function because other kernels like polynomial kernel are rarely used due to poor efficiency.

Applications of SVM in Real World :

As we have seen, SVMs depends on supervised learning algorithms. The aim of using SVM is to correctly classify unseen data. SVMs have a number of applications in several fields.

Some common applications of SVM are-

· Face detection — Svm classify parts of the image as a face and non-face and create a square boundary around the face.

· Text and hypertext categorization — SVMs allow Text and hypertext categorization for both inductive and transudative models. They use training data to classify documents into different categories. It categorizes on the basis of the score generated and then compares with the threshold value.

· Classification of images — Use of SVMs provides better search accuracy for image classification. It provides better accuracy in comparison to the traditional query-based searching techniques.

· Bioinformatics — It includes protein classification and cancer classification. We use SVM for identifying the classification of genes, patients on the basis of genes and other biological problems.

· Protein fold and remote homology detection — Apply SVM algorithms for protein remote homology detection.

· Handwriting recognition — We use SVMs to recognize handwritten characters used widely.

· Generalized predictive control(GPC) — Use SVM based GPC to control chaotic dynamics with useful parameters.

Pros :

· Effective on datasets with multiple features, like financial or medical data.

· Effective in cases where number of features is greater than the number of data points.

· Uses a subset of training points in the decision function called support vectors which makes it memory efficient.

· Different kernel functions can be specified for the decision function. You can use common kernels, but it’s also possible to specify custom kernels.

Cons :

·Choosing an appropriate Kernel function is difficult: Choosing an appropriate Kernel function (to handle the non-linear data) is not an easy task. It could be tricky and complex. In case of using a high dimension Kernel, you might generate too many support vectors which reduce the training speed drastically.

2. Extensive memory requirement: Algorithmic complexity and memory requirements of SVM are very high. You need a lot of memory since you have to store all the support vectors in the memory and this number grows abruptly with the training dataset size.

Linear Regression in Machine Learning

Dhivya — Wed, 12 Oct 2022 14:49:13 GMT

Linear Regression is an ML algorithm used for supervised learning. Linear regression performs the task to predict a dependent variable(target — y) based on the given independent variable(s). So, this regression technique finds out a linear relationship between a dependent variable and independent variables.

Linear regression sits on a straight line that minimizes mismatches between predicted and actual output values.

Types of Linear Regression:

The major types of linear regression are,

1. Simple Linear Regression

2. Multiple Linear Regression

3. Polynomial Linear Regression

Simple Linear Regression:

In Simple Linear Regression, we try to find the relationship between a single independent variable (input) and a corresponding dependent variable (output). This can be expressed in the form of a straight line.

Formula:

b0 — intercept,

b1 — coefficient or slope,

x — independent variable(Input datapoints)

y -dependent variable or Target Variable

Example:

Simple datasets of 1 independent variable (x =Years of Exp ) and 1 dependent variable (y= Salary)

Note:

1. We are using Simple Linear regression Algorithm to find b0 and b1. Using this Values we can find best fit line for Linear Data.

2. A linear Regression model main aim is to find the best fit line and the optimal value of intercept and co-efficient

3. b1 is Weight of input variable x.

4. b0 is Offset

Formula:

y = b0 + b1x

Salary = b0 + b1(years of Experience)

If b0 is 0,

Salary = 0 + b1 (Years of Experience)

Salary = b1 (Years of Experience)

If Years of Experience =0

Salary = 0

As per the result, for fresher who have no experience the salary is Zero. Its not possible in Real time.

We will use Ordinary Least Squares method to find the best line intercept (b0) , slope (b1)

Ordinary Least Square(OLS):

Ordinary least squares (OLS) is a linear regression technique used to find the best-fitting line for a set of data points. It is used to estimate the unknown parameters in a model. The method relies on minimizing the sum of squared residuals between the actual and predicted values. The residual (Residual Sum of Squares RSS)can be defined as the difference between the actual value and the predicted value.

Error is the difference between the actual value and Predicted value and the goal is to reduce this difference. The main objective of OLS method is to minimize this residual or error (cost function).

Formula :

We take the partial derivative of the above residual or error (cost function) with respect to the coefficients b0 and b1 of determination for minimizing the error, then we set the partial derivatives equal to zero and solve for each of the coefficients.

Formula:

For your reference I have added a clear picture of the coefficients that we derived from partial derivation of cost function(residual)

ŷi = b0 + b1x,

These b0 and b1 values are useful to reduce cost function of a simple linear regression model(Binary classifier), where there is only one independent variable xi and one dependent variable yi.

1. Multiple Linear regression :

Multiple linear regression works by changing parameter values to reduce cost, which is the degree of error between the model’s predictions and the the values in the training dataset. With simple linear regression, we had two parameters that needed to be tuned: b_0 (the y-intercept) and b_1 (the slope of the line).

In Multiple Linear Regression, we try to find the relationship between two or more independent variables (inputs) and the corresponding dependent variable (output). The independent variables can be continuous or categorical. It is an extension of linear (OLS) regression that uses just one explanatory variable.

Multiple Linear Regression

In this 3-Dimensional representation, the two horizontal axes represent the independent variables while the vertical axis represents the dependent variable.

So, the regressor tries to create an equation of a plane that best represents the training data it is given.

This means that the regressor will have to try out several different equations to see which plane best fits the data.

Formula:

Example:

4 Dimensional Multiple linear Regression

y = b0 + b1x1 +b2x2 +b3x3

Salary = b0 + b1 (Years of Exp) +b2(Technology) +b3(Gender)

Multi linear Regression Equation for Prediction,

Ŷ = b0 + b1x1 +b2x2 +b3x3

Math Equation for Cost function:

here,

X = X_train

Y = Y_Train

The Problem with Multiple linear Regression is,

Matrix transpose calculation is very complex and calculation will get slower on large Datasets. To solve this problem we use Gradient Descent optimization technique.

Polynomial Linear Regression:

Polynomial Regression is a form of Linear regression known as a special case of Multiple linear regression which estimates the relationship as an nth degree polynomial.

Polynomial Linear Regression

Polynomial regression is a form of Linear regression where only due to the Non-linear relationship between dependent and independent variables we add some polynomial terms to linear regression to convert it into Polynomial regression.

Suppose we have X as Independent data and Y as dependent data. Before feeding data to a mode in preprocessing stage we convert the input variables into polynomial terms using some degree.

Formula:

Example:

X- Input value =5

Degree of polynomial = 2.

b0 =2

Then,

xˆ0 = 5ˆ0 =1

xˆ1 = 5ˆ1 = 5

xˆ2 = 5ˆ2 =25

Input features are helps to build the non-linear relationship .The degree of order which to use is a Hyperparameter, and we need to choose it wisely. But using a high degree of polynomial tries to overfit the data and for smaller values of degree, the model tries to underfit so we need to find the optimum value of a degree.

When there are multiple features, Polynomial Regression is capable of finding relationships between features. This is made possible by the fact that Polynomial Features also adds all combinations of features up to the given degree.

For example, if there were two features a and b, Polynomial Features with degree=3 would not only add the features a², a³, b², and b³, but also the combinations ab, a²b, and ab².

Polynomial Regression is a form of Linear regression known as a special case of Multiple linear regression which estimates the relationship as an nth degree polynomial. Polynomial Regression is sensitive to outliers so the presence of one or two outliers can also badly affect the performance.

Advantages of Linear Regression

1. Linear Regression performs well when the dataset is linearly separable. We can use it to find the nature of the relationship among the variables.

2. Linear Regression is easier to implement, interpret and very efficient to train.

3. Linear Regression is prone to over-fitting but it can be easily avoided using some dimensionality reduction techniques, regularization (L1 and L2) techniques and cross-validation.

Disadvantages of Linear Regression

1. Main limitation of Linear Regression is the assumption of linearity between the dependent variable and the independent variables. In the real world, the data is rarely linearly separable. It assumes that there is a straight-line relationship between the dependent and independent variables which is incorrect many times.

2. Prone to noise and overfitting: If the number of observations are lesser than the number of features, Linear Regression should not be used, otherwise it may lead to overfit because is starts considering noise in this scenario while building the model.

3. Prone to outliers: Linear regression is very sensitive to outliers (anomalies). So, outliers should be analyzed and removed before applying Linear Regression to the dataset.

Git is a most popular Distributed version control system in the world.

Dhivya — Thu, 19 May 2022 17:20:15 GMT

Git is a most popular Distributed version control system in the world. It is used for tracking changes any set of files. Especially It helps you track different versions of your code, project history and collaborate with other developers.

Benefits of git

· Free and open source

· Performance

· Speed

· Scalable

· Git branches are cheap and easy to merge

Basic Commands

To start using Git, we are first going to open up our Command shell.

For Windows, you can use Git bash, which comes included in Git for Windows. For Mac and Linux you can use the built-in terminal.

Git Version

Command : git — version

Usage : You can check your current version of Git

Configure Git

Commands : git config –global user.name “name”

git config –global user.email “email address”

Usage : This command sets the author’s name and email address respectively to be used with your commits.

Creating Git Folder

Command : mkdir myproject

Usage : This command makes a new directory.

Command : cd myproject

Usage : This command changes the current working directory.

Initialize Git

Command : git init

Usage : Initialized empty Git repository in /Users/documents/myproject/.git/

You just created your first local Git. But it is empty.

So let’s add some files, or create a new file using text editor. Then save or move it to the folder you just created.

For Example:

Hello World!

And save it to our new folder as index.html.

Create a file name “ index.html “ in your local repository

Checking Git Status

Command : git status

Benefits of git

· Free and open source

· Performance

· Speed

· Scalable

· Git branches are cheap and easy to merge

Basic Commands

To start using Git, we are first going to open up our Command shell.

For Windows, you can use Git bash, which comes included in Git for Windows. For Mac and Linux you can use the built-in terminal.

Git Version

Command : git — version

Usage : You can check your current version of Git

Configure Git

Commands : git config –global user.name “name”

git config –global user.email “email address”

Usage : This command sets the author’s name and email address respectively to be used with your commits.

Creating Git Folder

Command : mkdir myproject

Usage : This command makes a new directory.

Command : cd myproject

Usage : This command changes the current working directory.

Initialize Git

Command : git init

Usage : Initialized empty Git repository in /Users/documents/myproject/.git/

You just created your first local Git. But it is empty.

So let’s add some files, or create a new file using text editor. Then save or move it to the folder you just created.

For Example:

Hello World!

And save it to our new folder as index.html.

Create a file name “ index.html “ in your local repository

Checking Git Status

Command : git status

On branch main

Your branch is up to date with ‘origin/main’.

Untracked files:

(use “git add …” to include in what will be committed)

index.html

git add

Command : git add [file]

Usage : This command adds a file to the staging area.

For example :

PS C:\Users \.git > git add index1.html

PS C:\Users \.git > git status

On branch main

Your branch is ahead of ‘origin/main’ by 1 commit.

(use “git push” to publish your local commits)

To add more than one file to add,

command : git add *

Usage : This command adds one or more to the staging area.

Git Commit

Since we have finished our work, we are ready move from stage to commit.

Adding commits keep track of our progress and changes as we work

Command : git commit -m “first commit”

For example

git add

Command : git add [file]

Usage : This command adds a file to the staging area.

For example :

To add more than one file to add,

command : git add *

Usage : This command adds one or more to the staging area.

Git Commit

Since we have finished our work, we are ready move from stage to commit.

Adding commits keep track of our progress and changes as we work

Command : git commit -m “first commit”

For example :

PS C:\Users \.git> git commit -m “First commit”

[main bfec008] First commit

1 file changed, 0 insertions(+), 0 deletions(-)

create mode 100644 index1.html

These are some of the basic commands in GIT.