MACHINE LEARNING BUZZWORDS

Aminah Mardiyyah Rufai
Analytics Vidhya
Published in
8 min readJul 17, 2020

Accuracy: An evaluation metrics for a supervised classification machine learning model. It is the ratio of the number of correct predictions to the number of predicted data points of a model.

Algorithm: “a set of steps/sequences of instructions designed to perform a particular task. For example, Steps/Procedure taken to prepare a particular dish. In the Machine Learning Space, this is simply a unique description or steps taken to solve a mathematical problem. Machine Learning algorithms aids how models map an input to a set output(s).

Yeah, that’s right. A model is quite different from an algorithm. While an Algorithm is a set of instructions for a model, a model is the result of the action actually carried out by the algorithm. In technical terms, it is the mathematical representation of a process.

Let’s break that down in simpler terms. Let’s say I want to make coffee. So there are some set of instructions/steps I have to follow to achieve that right? Great! I would also require some ingredients/items needed to achieve that such as coffee beans, water et cetera. Let’s call these ingredients plus the actual output(Coffee) — ‘Data’. So ideally the Data has some set of inputs(ingredients) as well as the desired output(Coffee). Great! Now the set of steps I would follow to actually make the coffee ( the process of getting the coffee beans, placing inside the coffee maker, adding water, brewing et cetera) is the “Algorithm”. I simply used one of the multiple ways of preparing coffee ( A model). Someone else, having the same set of ingredients (Data), might make a better coffee using some other method (A different Model).

In summary,

Model = Algorithm + data

“Training an Algorithm with data produces a Model.”

Bias: This is the difference between a model’s predicted value and the actual value:

Classification: This is a sub-category of supervised Machine Learning methods where data points are grouped into classes as output, using a decision boundary to separate each class.

Binary Classification

Confusion Matrix: A measure for the Accuracy and Precision of a model. It is a generally a performance measure for a classification problem. It is computed with a table of actual and predicted values, with the True Positive, False Positive, False Negative, and True Negative, calculated.

A confusion Matrix

Cross-Validation: A unique resampling/shuffling technique used for evaluating machine learning models. The train set is split into several groups/Folds(k-1 folds), with the average error for each fold estimated. So if for example the chosen value for k is 5(i.e k =5), the model will be trained on k-1 folds (I. E 4), and validated on the last fold (the leave-out). It is also called “K-fold cross-validation”

Data: Collected Information/Facts, measured/observed,and stored in a given format(text, images, videos).

Deep Learning: A subset of Machine Learning using the approach of neural networks. This branch works with algorithms built with the aim of mimicking the structure and function of the human brain.

Exploratory Data Analysis: Just as in the name, it is an ‘exploratory’ approach used to carefully analyze data for insights, exploring the characteristics of the data, seeking to make sense of the information presented. Ever heard the saying ‘Interrogate your data and it will share its secrets with you’.This is just the step for that. It is such an important step in a machine learning workflow, as it helps to build a more efficient model.

False Negative: False Negative is the number of times a model incorrectly predicts or outputs a negative result when it should be positive. An example is a model giving a diagnosis for cancer to be negative when it should have been a positive prediction.

False Positive: False Positive is the number of times a model incorrectly predicts or outputs a positive result when it should be negative. Back to the Cancer example, a False Positive is when the model gives a diagnosis for cancer to be positive when it should have been a negative prediction. A false affirmative result.

Features: These are characteristics of observed data points in a dataset. Let’s go back to the “coffee making” example earlier. In that example, some set of ingredients were required to achieve an output(Coffee). The ingredients(inputs) are and the coffee(output) are collectively known as Data. The ingredients are equally known as the features in this example. They are the characteristics that have been observed as a basic requirement for making coffee.

Feature Engineering: This is the process of combining two or more features in a dataset to form a new model useful feature. This is generally done based on some domain knowledge about the problem statement. Due to some sort of previous knowledge or intuition, features in a dataset are combined to form a new feature that is generally desired to improve the performance of a model. Sometimes, it is not necessarily the combination of two or more features, it could be deriving a new feature from just one. Example, deriving Age feature from Year of birth, or splitting a date column to different features such as month, day, year, time(in minutes, hours or seconds), weekend, weekdays, holidays et cetera.

Hyperparameters: Algorithm settings, similar to tuning radiofrequency for clarity of signals. These settings have a default built with the model.However, they can be tuned/adjusted for optimum model performance. Example, the learning rate of a model.

Learning rate: An adjustable Hyperparameter that determines the pace at which a model trains and learns a new concept.

Loss Function: This is a computation of the error in a single train set. It is the sum of the average difference between the predicted values and the actual values. The general aim is to minimize the loss function.

Machine learning: Machine Learning is a subset of Artificial Intelligence that provides computers the ability to learn without being explicitly programmed. It is the study of computer algorithms that gives machines the ability to perform tasks, think, learn, and improve automatically through a collection of examples.

Outliers: These are data points that show a significant difference or follow a different pattern from other data points. This is very useful when studying trends, it is a very important concept in Fraud detection. However, take note that not all outliers are useful. Imagine having a dataset where there is a record for an adult who weighs -20kg. That is neither possible nor relevant and would require some preprocessing.

Overfitting: A result of high variance and High bias. It is simply an occurrence where a model memorized the train set and performs badly on the test set. The training error is usually low, while the test error is high.

Precision: A measure for the false positives of a model. In certain situations, a False Positive is considered of greater risk than a False negative.

For example, building a model for a Spam/ham Classification. The False positive, the number of times the model wrongly predicts an email to be Spam when it is not. Imagine a very important email going into the spam folder. Not nice right!!

Preprocessing: The process of preparing raw data suitable for a Machine learning model.

Recall: A Measure of the number of False Negatives. The opposite of Precision. Most useful in cases when the false negative is a more detrimental error than the false positive.

Regularization: The process of adding a parameter to prevent overfitting in a model.

Regression: Another sub-category of supervised Machine Learning Methods where rather than classes, the resulting outputs/prediction is a continuous value

Train Set: The part/set of the data to be used for training the algorithm. The contains features mapped to sample outputs/labels.

Test Set: The part of the data used for testing how well a model has learned. It contains only features without sample labels.

True Negative: This is the number of times, a model correctly predicts the opposite of a correct label/class. Say for example a model is required to predict the number of survivors of the Titanic tragedy, and it correct predicts a number of times those who truly are not survivors, i.e those who died and did not survive.

True Positive: This is the number of times a model correctly predicts an actual label/class. Let’s go back to the titanic example, the True Positive will be the number of times the model correctly predicts the survivors.

Both True Positive and True Negative find application primarily in Classification related problems.

Validation Set: A part of the data set aside for finally evaluating a model.

Underfitting: A result of low variance and High bias. This is an occurrence where the model did not properly learn on the train set, hence cannot make significant predictions. This usually occurs when there are not enough data points for the algorithm to be trained on for a given problem. Or the train and test sets were not properly randomized.

Variance: This is the degree of randomness/scatteredness of predicted values i.e how far apart they are from one another on a plot.

END NOTE:

The aim of this article was to explain some of the commonly used terms in Machine learning. Please reference the links below for a better grasp of the concepts. Hope you found this resourceful.

Thanks for reading!!

--

--

Aminah Mardiyyah Rufai
Analytics Vidhya

Machine Learning Researcher | Machine Intelligence student at African Institute for Mathematical Sciences and Machine Intelligence | PHD Candidate