Basic terminologies to start with Machine Learning

Saaisri
featurepreneur
Published in
6 min readJul 31, 2022

“Machine learning” is one of the current technology buzzwords, often used in parallel with artificial intelligence, deep learning, and big data, but what does it actually mean? And what other machine learning terminology is important to understand?

Machine learning means that the accuracy of the system is improving over time, with the addition of more data and feedback. You probably encounter many examples of machine learning every day without realizing it. When Facebook suggests “people you might know” or when Amazon emails you recommendations of products you might like based on previous purchases, they are using machine learning algorithms to customize your results.

Machine Learning Terminology

Classification

Classification is a part of supervised learning (learning with labeled data) through which data inputs can be easily separated into categories. In machine learning, there can be binary classifiers with only two outcomes (e.g., spam, non-spam) or multi-class classifiers (e.g., types of books, animal species, etc.).

Clustering

Clustering can be used to organize customer demographics and purchasing behavior into specific segments for targeting and product positioning. It can also analyze housing quality and geographic locations to create real estate valuations and plan the layout of new city developments. It can classify information by topics within libraries or web pages and compile an easily accessible directory for users.

Regressions

Regressions create relationships and correlations between different types of data. For example, each profile picture has an image with pixels that belong to a person. With static prediction (one that stays the same over time), machine learning acknowledges that a certain pixel arrangement corresponds to a given name and allows for facial recognition (for example, when Facebook recommends tags for the photos you’ve just uploaded).

Supervised learning

This is the training of a machine learning algorithm with data annotated with labels. The annotation of data is typically provided by an expert system, such as a human or external system. The task of classification is an example of a supervised learning task.

Unsupervised learning

Algorithms designed to tackle this type of learning have self-organizing characteristics built into them. These algorithms self organize data based on patterns detected in data without the involvement of an expert system.

Semi-supervised learning

Machine learning algorithms that are semi-supervised consist of both unlabeled and labeled training data. The frequency of labeled data in the distribution of the training dataset is usually on a smaller scale in comparison to unlabeled training data.

Reinforcement learning

This is a type of Machine learning technique that involves defined programs that are referred to as agents. These agents are placed in an environment and are governed by the notion of the increase of rewards through interactions with the environment. The agents are designed to aim to accumulate rewards where possible. There is also the form of negative rewards or penalties. The agent task is to improve its governing system to collect rewards over time and avoid penalties.

Model

This can be described as a mathematical representation of the generalized pattern observed in a dataset.

Dataset

This is a collection of information that contains related elements that can be treated by a machine learning algorithm as a single unit.

Training Dataset

This is the group of our dataset that is used to train our neural network directly. In the task of using a convolution neural network for classification, the training set data’s images and labels relationships will be learned by the network. These are the group of our dataset the network sees during training.

Test Dataset

We utilize this group of the dataset to evaluate the performance of our network after the training stage is completed.

Underfitting

This occurs when a machine learning algorithm fails to learn the patterns in a dataset. Underfitting can be fixed by using a better algorithm or model that is more suited for the task. Underfitting can also be adjusted fixed by recognizing more features within the data and presenting it to the algorithm.

Overfitting

This problem involves the algorithm predicting new instances of patterns presented to it, based too closely on instances of patterns it observed during training. This can cause the machine-learning algorithm to not generalize accurately to unseen data. Overfitting can occur if the training data does not accurately represent the distribution of test data. Overfitting can be fixed by reducing the number of features in the training data and reducing the complexity of the network through various techniques.

To Evaluate the performance of a machine learning model we have a set of metrices which are

  • Confusion matrix
  • Precision
  • Accuracy
  • Recall
  • F1 Score

Confusion matrix

A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It is used to measure the performance of a classification model. It can be used to evaluate the performance of a classification model through the calculation of performance metrics like accuracy, precision, recall, and F1-score.Confusion matrices are widely used because they give a better idea of a model’s performance than classification accuracy does.

Accuracy

Accuracy represents the number of correctly classified data instances over the total number of data instances. Accuracy may not be a good measure if the dataset is not balanced (both negative and positive classes have different number of data instances).

Example: Accuracy = (55 + 30)/(55 + 5 + 30 + 10 ) = 0.85 and in percentage the accuracy will be 85%.

Precision

Precision should ideally be 1 (high) for a good classifier. Precision becomes 1 only when the numerator and denominator are equal i.e TP = TP +FP, this also means FP is zero. As FP increases the value of denominator becomes greater than the numerator and precision value decreases (which we don’t want).

So in the pregnancy example, precision = 30/(30+ 5) = 0.857

Recall

Recall is also known as sensitivity or true positive rate

Recall should ideally be 1 (high) for a good classifier. Recall becomes 1 only when the numerator and denominator are equal i.e TP = TP +FN, this also means FN is zero. As FN increases the value of denominator becomes greater than the numerator and recall value decreases (which we don’t want).

So in the pregnancy example let us see what will be the recall.

Recall = 30/(30+ 10) = 0.75

F1 Score

F1-score is a metric which takes into account both precision and recall and is defined as follows:

F1 Score becomes 1 only when precision and recall are both 1. F1 score becomes high only when both precision and recall are high. F1 score is the harmonic mean of precision and recall and is a better measure than accuracy.

In the pregnancy example, F1 Score = 2* ( 0.857 * 0.75)/(0.857 + 0.75) = 0.799.

--

--