A simple introduction into supervised learning

Gerzson Boros
5 min readJul 25, 2023

Supervised learning, a sub-branch of machine learning, has been a hot topic for tech enthusiasts, data scientists, and businesses alike. As the cornerstone of artificial intelligence (AI), it offers revolutionary solutions to a wide array of real-world problems, transforming the way we understand and interact with the world around us.

In this comprehensive guide, we will delve into the depths of the fascinating realm of supervised learning, exploring its key concepts, methodologies, and applications in various industries.

What is Supervised Learning?

Supervised learning is an aspect of machine learning where the model is trained on a labelled dataset. The model, under human supervision, learns from the provided examples or experiences. This learning method is akin to a student learning under the guidance of a teacher.

Consider a scenario where a computer system is provided with a dataset containing images of various fruits. The system, through supervised learning, aims to categorize the fruits based on their respective characteristics such as color, shape, and size. Once the system is trained, it can predict the category of new, unseen fruit images by comparing them with the training data.

Supervised Learning: A Closer Look

Supervised learning is fundamentally a mapping function that connects input data with a corresponding output. This function, created by the machine learning model during training, assigns a predicted output or label to an input value. The objective of a supervised learning model is to accurately predict the correct label for new, unseen input data.

The training data in supervised learning consists of inputs paired with the correct outputs. During the training phase, the algorithm searches for patterns in the data that correlate with the desired outputs. After the training, the model can take new unseen inputs and determine which label the new inputs should be classified as, based on prior training data.

Breaking Down Supervised Learning: Classification and Regression

Supervised learning can be bifurcated into two subcategories: classification and regression.

Classification

During the training phase, a classification algorithm is provided with data points each associated with a specific category. The task of the classification algorithm is to assign an input value to a class or category based on the training data.

For example, consider an email classification system. The system is trained on a dataset containing spam and non-spam emails. The model identifies the features within the data that correlate to either class and creates the mapping function. When a new email arrives, the model uses this function to determine if the email is spam or not.

Regression

Regression models, on the other hand, are used when the output is a real or continuous value. For example, predicting the price of a house based on its size, location, and other factors employs a regression model. The labels in this case are numerical values, and the model predicts these values based on the input features.

The Machine Learning Workflow

The machine learning workflow involves more than just building and training a model. There are several steps, which ensure that we’re building a model that produces good results. These steps include:

  • Data Exploration and Wrangling
  • Data Preparation
  • Building and Training a Model
  • Evaluating and Fine-Tuning the Model
  • Evaluate the Model on a Test Set

These steps are further explained below:

Data Exploration and Wrangling

This step ensures that relevant features are used to train the model. Exploring and cleaning the dataset can reveal connections between different features and the output classes. These relevant features are then selected to train the model.

Data Preparation

Data preparation involves transforming the features so they can be effectively used to train the model. This process, known as feature engineering, can include normalizing numerical features, and encoding categorical features.

Building and Training a Model

This stage involves selecting an appropriate algorithm and using it to train the model on the prepared data. The model learns patterns in the training data which it can then apply to unseen data to make predictions.

Evaluating and Fine-Tuning the Model

Once the model is trained, it is evaluated on a validation set and fine-tuned based on the evaluation. Model hyperparameters, which are certain parameters set by the user when training the model, can be adjusted to improve the model’s performance.

Evaluate the Model on a Test Set

Finally, the model is evaluated on a test set to get an unbiased estimate of its performance on new, unseen data.

Supervised Learning Algorithms: An Overview

Supervised learning deploys various algorithms and computation techniques to derive insights from data. Some of the most commonly used learning methods include neural networks, naive bayes, linear regression, logistic regression, support vector machines (SVM), and the k-nearest neighbor algorithm.

Neural Networks

Neural networks, primarily used for deep learning algorithms, mimic the interconnectivity of the human brain through layers of nodes to process training data. Each node consists of inputs, weights, a bias, and an output. The network learns the mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent.

Naive Bayes

Naive Bayes is a classification approach that uses the principle of class conditional independence from the Bayes Theorem. This technique is primarily used in text classification, spam identification, and recommendation systems.

Linear Regression

Linear regression is used to identify the relationship between a dependent variable and one or more independent variables. It is typically used to make predictions about future events, for example, predicting stock market information to anticipate upcoming fluctuations.

Logistic Regression

Logistic regression is used when the dependent variable is categorical, meaning they have binary outputs, such as “true” and “false” or “yes” and “no”. Logistic regression is mainly used to solve binary classification problems, such as spam identification.

Support Vector Machines (SVM)

Support Vector Machines (SVM) is a popular supervised learning model used for both data classification and regression. SVM constructs a hyperplane where the distance between two classes of data points is at its maximum. This hyperplane is known as the decision boundary, separating the classes of data points on either side of the plane.

K-Nearest Neighbor

The k-nearest neighbor algorithm is a pattern recognition model that can be used for classification as well as regression. It classifies new data points by looking at how closely-related they are to other data points in context of their labels.

Summary

To summarize, supervised learning as a subset of machine learning offers powerful tools for solving complex problems across a broad range of industries. By understanding the fundamental concepts, methodologies, and real-world applications of supervised learning, businesses can leverage this technology to drive innovation, enhance decision-making, and create a competitive advantage in the marketplace.

Whether you’re a data scientist developing novel models, or a machine learning engineer deploying these models in production, understanding the nuances of supervised learning can provide critical insights that could shape the future of your business.

--

--

Gerzson Boros

CEO of Data Science Europe, Instructor, Chief Machine Learning Engineer, Writer ------ Carrier coach, Data Science coach, Helping people to learn ML