Machine Learning Part 1: The Fundamentals

Görkem Aksoy
8 min readJan 4, 2023

--

“Can machines learn?” More than half a century has passed since the question was asked. Until now, we have taught machines many things, and we will continue to teach them in the future. One of the things we teach in this process is the concept of “Machine Learning”, which we can describe as the detection of patterns in real-life data and the representation of these data in the virtual environment.

Cited from https://towardsdatascience.com/introduction-to-machine-learning-for-beginners-eed6024fdb08

In this series of articles, I will talk about many topics, starting with the basic concepts in Machine Learning, to advanced models. Let’s start with the basic concepts first:

  • Variable Types
  • Learning Types
  • Model Success Evaluation Methods
  • Model Validation Methods

1-Types of Variables

Numeric vs. Categorical Variable

When building a machine learning model, the variables (columns) in the data set can be numeric or categorical. Numeric variable refers to variables whose original number of values is too large to be categorized ( temperature, age, salary, etc.). Categorical variable, on the other hand, is used to define variables that contain a limited number of unique values ( gender, marital status, etc.). In the data set I have mentioned below, the variable of the number of pregnancies in column A is a categorical variable, while the variable in the last column is in the category of categorical variable because it can only take the values 0 and 1. Other variables in this dataset are numerical variables.

Dependent vs. Independent variable

In the machine learning model, the values we try to predict are called “dependent variables”, and the other variables we use to predict the dependent variable are called “independent variables”. For example, the data set I have mentioned below was created to predict diabetes through some individuals’ personal data. The variable in the last column (Outcome) is the dependent variable, and the variables in the other columns we use to estimate this value are the independent variables.

There are many different expressions of dependent and independent variables in the literatüre. Unfortunately, this causes conceptual confusion.

For dependent variable -> target, dependent, output, response can also be used interchangebly.

For independent variable(s) -> feature, independent, input, column, predictor can be used as synonyms.

Diabetes dataset

2-Types of Learning

Supervised Learning

Unsupervised Learning

Reinforcement Learning

It is necessary to decide how the model we will establish will perform learning. If there is a “dependent variable” in our model that we are trying to predict, then we can talk about Supervised Learning. If there is no dependent variable, but if we are trying to detect patterns in the data set in general, and we are trying to classify the data, then Unsupervised Learning is at hand.

There is also Reinforcement Learning. Reinforcement Learning is a type of learning that focuses on what actions a “subject” should perform with a reward/punishment approach. Reinforcement Learning is outside the scope of this article series.

3-Types of Model Success Evaluation

The question of whether the model we built works has led to the emergence of the methods mentioned here. Let’s look at how we can evaluate the success of our model:

If we’re talking about success or failure, there must be a “tag”, a value we’re trying to find, so that we can compare. For this reason, for this subject, we will evaluate the Supervised Learning, that is, the models we set up in the presence of the dependent variable.

I mentioned that variables are divided into numerical and categorical. Success and failure evaluations are also shaped according to this distinction. In other words, the evaluation of success when the dependent variable is numerical is different than when it is categorical.

If the Dependent Variable is a numeric type:

In the case where the dependent variable is numeric, we can use one of the following methods:

  • Mean Squared Error — MSE
Mean Squared Error
  • Mean Absolute Error — MAE
Mean Absolute Error

First, let’s take a look at the formulas. Our evaluation of success is up to the difference between the actual value of the dependent variable and the value we predicted.

Let’s say we have 100 different data and we are trying to estimate the age of the 101st person with this data. The yi value represents the age of the 101st person, and the yi hat represents the 101st person, and our estimate for the 101st person based on the data of 100 people.

The total number of people we estimate is indicated by the letter n. The success of the model is determined by taking the average value over all observations.

Since the formulas of both methods are different, it is highly likely that we will get different results when evaluating the success of the model you have established. The important thing here is not to compare MAE and MSE values in the same model, but to compare the same metric (MAE in all or MSE in all) in different models.

The main difference between these two methods is the value we place on the size of the difference between the forecast and the actual. In other words, for an observation that we made a very bad estimation, Mean Error Squares imposes an even greater penalty, while the Average of Absolute Errors method gives penalty to the exact differences. Let’s visualize it with an example:

Suppose we are comparing the prediction successes of two different models. The predicted/actual values and MAE and MSE scores for the two Models are as follows:

There are 6 different observations here, and Model A gave closer results in all but one of these observations. In one observation, it performed very badly, and as such, it seems to be much worse than Model B according to the MSE value, but it seems to have performed better according to the MAE value. There may be different approaches to which model to choose, depending on the field of study. If the differences are critical, MSE can be used.

If the Dependent Variable is a categorical variable:

In the case where the dependent variable is categorical, for example taking the value 0–1, the metric “accuracy”, is used to evaluate the model success. In its simplest form, we look at the ratio of the observations we estimate correctly (which we estimate as 1 if actual is 1, 0 if actual is 0) to the total number of observations. Although it seems like a very simple method, it is very useful when not dealing with an unbalanced data set. In other words, if the number of observations of different categories within the observations is close to each other (For example, if 53 out of 100 are 1, 47 out of 100 are 0), then the accuracy metric can be used. But you will appreciate that this is often not the case.

For unbalanced datasets, we need to check for different values in addition to the accuracy value. I will elaborate on this in the Logistic Regression section.

4-Model Verification Methods

How we test the accuracy of the model you are trying to build is an important issue. There are different approaches to this issue. In this part, which may be difficult to understand at first, I will try to enlighten you.

In order to convey this subject well, I will proceed by establishing an analogy over football. We can compare the work we do to build the model to the training we do with the team. We can perform very well in training, but it will not be possible for us to show the same performance in the most important part, namely in the match. If we evaluate our performance only in terms of training, we fall into the illusion of seeing ourselves as better than we are. Trying to evaluate the performance of the model over the data set we built the model is just like that. We may think that we have built a very good model and the accuracy rate is very high, but when we test our model on new data, the result is likely to be disappointing.

So what is the way to better understand whether we can perform well in matches and to look at ourselves truly? This is where the concept of validation comes into play. In order to be able to predict our performance in the matches, we need to go to the practice matches as well as the training sessions. So what is the equivalent of friendly matches for our model? I will talk about two different methods:

  • Hold-out Method

We divide the model into two parts as “train” and “test”, create our model with the “train” part, and test the success of the model we have created with the “test” part. Thus, we have the opportunity to test the model we built on a part of the data set that the model has never seen (just like playing a friendly match with a team we haven’t met).

  • Cross Validation Method

Unlike the hold-out method, it is the method in which we divide the part we have separated as “train” into k separate parts, “train” the model over k-1 parts, and proceed by verifying over the left alone part. It might be a little confusing. I think it will be better understood with the following image:

Cited from https://scikit-learn.org/stable/modules/cross_validation.html

In this example, the “train” set is divided into 5 parts, that is, the value of the k parameter is 5. In each split, the model is built with 4 parts and validation is performed with the remaining part. Let’s say the “train” set consists of 1000 different observations. In this case, we divide the train set into 5 separate pieces, each of which consists of 200 randomly selected observations, and we build the model with the remaining 4 pieces, each piece being evaluated as a validation set once. In total, we set up 5 different models, that is, as many as k models. By establishing a single common model over 5 different models, the performance of the model is evaluated on the test set.

--

--

Görkem Aksoy

BSc in Industrial Engineering - METU | Data-Science | Statistics | Content Creator