Supervised Machine Learning for Beginners

11 min readMar 11, 2023

My first post is on supervised machine learning (80%), which is a large part of machine learning in general. For further information on unsupervised machine learning, please refer to the provided link. Similarly, for deep learning, you can access the suggested link. Additionally, my article on “EDA (Python)” offers insights into utilizing basic functionalities for initiating exploratory data analyses at a beginner level within the Python environment.

In order to visualize machine learning in the mind, it is useful to first know what artificial intelligence is in general terms. I read somewhere a very good example on the meaning of AI. The failure to define AI, it says here, turns it into “that of alchemists who are looking for the philosopher’s stone but have almost no idea what they’re looking for.”[1]

If artificial intelligence is simply defined, it is a set of algorithms that enable machines to gain skills first through humans, and then examine, make sense of, and derive specific meanings from situations where it will take time and perhaps not be possible for the human mind to do through those skills. In 1950, in his article “Can Machines Think?”[2] Alan Turing, who started with a striking question like this, took the first steps in this field. The first person to define the name machine learning was Arthur Samuel. In 1959, he introduced learning schemes that played Checkers better than the average person. [3]

Continuing from ML, which is our main topic, it can be divided into three main headings. The most used of these is supervised machine learning. If generalization is made, it can be said that 80% of ML is done through these algorithms. Here, there is ground truth and its feature. The machine trains with all this and gets an estimate. In general, usage areas are divided into prediction and classification. In unsupervised machine learning, there is no ground truth corresponding to the data. The data set is analyzed over the features. Humans are needed to make sense of the results of the analysis. In supervised machine learning, there is a hypothesis that you guessed. To verify this, the data set is divided into test and train. However, since there is no hypothesis to be verified in unsupervised machine learning, the data set is used in its entirety. The purpose here is to recognize the data.

**Fig.1.** input data set given to the model for supervised and unsupervised machine learning [created by the author in Canva]

If we briefly state what Regression and Classification are, it can be said that there is a continuous value estimate for regression. Market forecasting can be given as a field of use. For example, the price estimation of the vehicle made based on the information given, etc. Classification makes a class distinction. For example, finding out whether the credit card charges are real or fake, etc.

In this article, I will briefly talk about Linear Regression under the title of regression, and Logistic Regression, Decision Tree and Distance Based algorithms under the title of classification. It should also be noted that algorithms under classification, other than Logistic Regression, can be used for regression.

First of all, it is useful to briefly mention the metrics used in regression algorithms. Keeping the estimated new values (ý) and the actual values (y) close to each other is a very important scale. At the same time, metrics are obtained separately from test and train data and compared with each other. If the values are too far from each other, it is observed that the model is overfitting and the train seems to have learned very well, and the probability of giving correct results in the test set is very low. From this it follows that he does not generalize the model. In the case of underfitting, the model does not learn enough by generalizing too much. Evaluation metrics or loss functions used to evaluate the results found using the regression algorithms on the train and test set, respectively:

MAE can be used if there are outliers in the data and they are to be ignored. If these are not wanted to be sacrificed, it would be more correct to use MSE. Also, RMSE can be used if the model is easier to interpret while outliers are not wanted to be ignored.

Linear Regression: The logic of Linear Regression, a parametric regression algorithm, is simply curve fitting. This curve can be found using the Ordinary Least Square (OLS) Method.

A line is applied by calculating the most appropriate parameters over the given features (xi) and ground truth (y). Coefficients are calculated with this line and new estimates are obtained from these coefficients.

**Fig.2.** Linear Regression [created by the author in Canva]

If regularization is involved (Ridge: L1, Lasso: L2, ElasticNet), the Gradient Descent method comes into play to solve the optimization problem.

Here α is the learning rate and values close to zero are usually chosen. The partial derivative of the cost function is taken according to the θ parameter and the process repeats itself until it reaches its minimum value.

**Fig.3.** Gradient Descent [created by the author in Canva]

The metrics used in the classification methods are selected according to the balanced and unbalanced status of the data set. Accuracy can be used if the data set has a balance distribution. However, if there is an unbalanced data set, the stability of the model can be evaluated over Precision, Recall or the F1 score, which is the harmonic average of these two. Recall, with our true condition positive axis; Precision condition false is associated with our negative axis. These metrics are calculated with the numbers of the classified samples in the conditioning matrix. Their placement in the condition matrix is evaluated on whether the samples are placed correctly and in the expected class.

TP (True Positive), the hypothesis put forward was predicted to be true and it turned out to be true.
FP (False Positive), the hypothesis put forward was predicted to be true but turned out to be false.
FN (False Negative), he hypothesis put forward was predicted to be false but turned out to be true.
TN (True Negative), the hypothesis put forward was predicted to be false and turned out to be false.

**Fig.4.** Condition Matrix [created by the author in Canva]

Logistic Regression: Logistic Regression, which is a parametric algorithm, is a classification algorithm. It covers the positioning of the target by calculating a probability over the features for ground truth. The reason why it has regression in its name is because the regression algorithm is running in the background.

**Fig.5.** Linear Regression & Logistic Regression [created by the author in Canva]

The value calculated by Linear Regression is passed through the Sigmoid function to obtain a probability (0–1). Comments are made over the Odd equation above the sigmoid function. Each probability value calculated according to the target is in the range of 0 to 1, and it performs the classification according to whether it is above or below 0.5. Here cost function is Maximum Likelihood Estimation.

Decision Tree: This algorithm is a nonparametric algorithm. Decisions are made through the decision trees with the determined threshold and the result is reached. The mentioned threshold value is the limit to be cut from the feature. With this threshold value, split is made for the field that is thought to be pure, it continues to ask questions over threshold for the impure field, and this process continues until it says stop. For this reason, this algorithm is said to be a greedy algorithm. If not pruned it will cause overfitting.

**Fig.6.** Decision Tree [created by the author in Canva]

If it is necessary to touch on the concept of purity, its measurement is made with the Entropy and Gini functions. Entropy and Gini being 0 means that the class is separated in its purest form, while being at the top indicates the place where the complexity in the class is most intense.

**Fig.7.** Gini & Entropy [created by the author in Canva]

Since it is a nonparametric algorithm, it is not affected by multicollinearity and the fact that the data does not have a normal distribution. There is no need for scaling in the dataset. If you scale it, it won’t cause a problem, but it won’t have an effect either. Only the value of the features is changed, not the place where they will be cut. The feature that provides the most profit to the model is the most important feature. The computational cost of this algorithm is very high and it is not a stable algorithm. The model is completely affected by the smallest change made. Its biggest disadvantage is its high variance problem.

Due to the high variance problem of the Decision tree algorithm, Ensemble methods have been developed to obtain more accurate results with this algorithm. These methods, which we call ensemble learning, involve the creation of a meta learner by bringing together more than one weak learner. It is divided into two groups as homogeneous and heterogeneous. Homogeneous ensemble methods are performed with two different methods. One of them is Bagging and the other is Boosting methods. In the working principle of Bagging, subsamples taken from the main data are created differently for each tree, while the same subsample is inserted into each tree created in the working principle of Boosting. In other words, while creating meta learner with results from different data sets in Bagging, trees do not affect each other. As a result, a joint decision is made for the result that all of them find. In Boosting, after the same subsample is trained on each tree in turn, it is given as an input to a new tree with its trained form, and the trees affect each other. This means that while the effect of weak learners on meta learners in Bagging is equal, the effect of weak learners on meta learners in Boosting is not equal.

The aim in ensemble learning is to obtain controllable models. Bagging methods work to reduce variance, while Boosting methods try to improve by reducing bias.

**Fig.8.** Bagging and Boosting at Ensemble Learning [created by the author in Canva]

Random Forest: It is a bagging ensemble method in which 2/3 of the N subsamples created from the main data set are given to the tree. 1/3 is used to measure the performance of the tree. It also doesn’t use all features, it can use up to √N features for example. In random forest, it is tried to prevent the high variance problem with the random working principle.

**Fig.9.** Random Forest [created by the author in Canva]

AdaBoost: Boosting ensemble, as a method, comes to a conclusion by inserting the data into different trees and manipulating the data in each tree. It manipulates the data by weighting the errors.

**Fig.10.** AdaBoost [created by the author in Canva]

Gradient Boosting: In this model, which is one of the boosting models, the Gradient Descent algorithm comes into play. It finds the minimum value in the curve of the Loss function and returns the hyperparameters there. Thus, it tries to reduce the error of the model. It is a hybrid model based on a Decision tree.

**Fig.11.** Gradient Boosting [created by the author in Canva]

XGBoost: This model is the extreme case of gradient boosting. An extreme version has been developed by regularization on trees. Split function is not entropy or gini, similarity score is important here. By doing parallel processing, it gets results faster than the gradient boosting model, which can only work in series.

**Fig.12.** Extreme Gradient Boosting [created by the author in Canva]

K-Nearest Neighbors (KNN): KNN, which is a distance-based algorithm, looks at the closest points of the selected points and positions them in a class according to itself. Here, the number of neighbors is represented by k. Selected points can be weighted. In addition, it is important how to calculate the distance between points, and the three most commonly used approaches from these calculation methods are Euclidean, Minkowski and Manhattan.

**Fig.13.** K-Nearest Neighbors and some distance algorithm [created by the author in Canva]

Support Vector Machine (SVM): SVM, which is a distance-based algorithm, creates support vectors over the data and draws a line between these vectors. With the help of a hyperplane, the data is decomposed. The area where the line is drawn can be expanded and enlarged, this area is called the margin. In order to get good results from the model, the margin width can be used here, and the margin can also be bent and twisted according to the data. But if it bends too much, it will cause overfitting. It is a suitable method for small and medium-sized data sets.

**Fig.14.** Support Vector Machine [created by the author in Canva]

If there is a complex, not easily separable data set in two dimensions, then the kernel trick can be used. RBF (Radial Basis Function) Kernel, which is one of the kernel trick types, is a method that helps to decompose a 2-dimensional data hard by converting it into a 3-dimensional data set.

**Fig.15.** Support Vector Machine — RBF Kernel [created by the author in Canva]

Finally, if the general steps followed in machine learning are to be reviewed, the first thing to do is “What is the problem?” will ask the question. After the problem is found, the exploratory data analysis (EDA) needed for the problem is performed. Preprocessing is managed according to need, impute, encode and scale operations are performed. When these are completed, the dataset for supervised machine learning is divided into two groups, train and test. Evaluation is carried out on the train set, and an analysis is obtained on the test set to compare the accuracy of the model. After the data set is separated, the model is established according to the desired algorithm and metrics are taken according to the model. The stability of the model is checked through these metrics. According to the underfitting or overfitting situation, the model is accepted or improvement is made.

References

[1] AI- Project. (2000, November). AI — What is this. https://dobrev.com/AI/definition.html

[2] Turing, A. M. (1980). Computing Machinery and Intelligence. Creative Computing, 6(1), 44–53.

[3] Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3), 210- 229.

[4] Geron, A. (2019). Handson Machine Learning with Scikitlearn, Keras & TensorFlow. o’Reiley Media. Inc, Sebatopol, CA.

[5] Murphy, K. P. (2022). Probabilistic machine learning: an introduction. MIT press.

Supervised Machine Learning for Beginners

Written by Aysen Çeliktaş