Machine Learning

Nibesh Khadka
Analytics Vidhya
Published in
4 min readOct 14, 2020

Understand common terms in ML.

Well hello there, today I am trying define common terms in machine learning. Its a rudimentary, novice level so please skip if you don’t fit in.

What is Machine Learning?

Machine learning(ML) is a field of Artificial Intelligence(AI), that is concerned with teaching machine to learn to perform a specific task.

How do Machine’s Learn?

Keep in mind that Machine means an mathematical algorithm or a computer system not terminator.

Have u ever heard a term “History repeats itself!”. Well its true, there is a pattern on how things happen around us. With a accurate problem definition and relevant data we can use machine learning algorithms to find patterns in data. The algorithms predict an output based on the past events. Now, u might be wondering if we can predict with 100% accuracy, I would say no, but its quite close, after all we are still predicting.

Now lets define some keywords that are quite common in Data Science / Machine learning tongues.

Data: Data is just an information in its raw, unrefined form. Machine Learning team first spends significant amount of time collecting the right data. The amount of data required is not always known from the beginning, but more the merrier. Now what data u want to collect, strictly depends on the problem u r trying to solve.

Data-frame: Its a structure of data in tabular form with rows, columns, headers(sometimes absent).

Features/ Variables: Features and Variables are used interchangeably in ML. Basically, a feature is a column. It represents a specific property of a data. For instance, age, gender, name, address are all separate feature columns.

Observations: They are rows in a data-frame. They represent a unique sample. For instance, Lets say information of a person named Max who is 29 years old, male and lives in Amsterdam. This can be taken as one observation or a row in data-frame.

What are train, test and validation data?

Train data is the data that is separated and used only for training ML model.

Validation data is the data that’s used to optimize the trained model.

Test set is a small chunk of data that’s been set aside for final evaluation of model. Test set are not used for optimization.

In ML, we are attempting to predict future unseen instances, if we train, validate and test model in same data, our model will fail miserably in production phase. Hence, splitting data is essential. Also, it also a good measure to use test set that’s can imitate future datasets, of course it is not always possible.

Error: In high level, error is the difference between real value and the value predicted by model. But in the low end its raw error. Actual error is usually modified. Some common metrics to measure errors are Mean Squared Error(MSE), Mean Absolute Error(MAE), Root Mean Squared Error(RMSE).

What’s the difference in raw error and actual error?

During error calculation and optimization, as the name of error metrics suggested they are either squared, or changed to positive value. This operations result in magnification of error. In other words big error gets bigger and small ones get smaller.

For instance,
0.5^2 = 0.25 < 0.5, 5^2=25 > 5

Error is magnified, because on doing so, error reduction process focuses more on higher error than smaller ones.

Bias: An error due to pre-assumption regarding a data which is bias. It’s the situation where algorithm fails to catch the main essence of data. It results in model underfitting.

Detect Underfitting: When training error as well as test data error is high then model’s underfitting.

Variance: A situation where model captures each and every information or rather noise in this case. So, slight change in data results in drastic change in output. Variance results in overfitting. Its also knows as sensitivity.

Detect Overfitting: Train error is always lower than test error. However, when the error margin between them is really high then we should consider the possibility that the model is overfitting.

In short, if your output is inconsistent/scattered and incorrect its variance, while if the output consistent but wrong its bias.

Bias-Variance Trade off: In ideal world we aim to reduce bias as well as variance for perfect result. However, in real world its not always possible. So, we aim to reduce the test error as much as possible. Hence, we end up trading bias for lower variance or vice-versa.

Fixing Bias: Adding new features can assist to improve underfitting.

Fixing Variance: Decreasing number of features or increasing number of training data can help in fixing overfitting.

Noise/Outliers: Outliers can be considered anomaly. In simple they are awkward values in a list.

In ML prediction is output of some mathematical operation between input features and their corresponding weights, hence value of input makes a huge difference. For instance,

list = [2,3,4,5,6,10000]
sum = 10020
mean = 1670
list_2 = [2,3,4,5,6]
sum_2 = 20
mean_2 = 4

See the difference, this can alter the result significantly. This is why outliers should be detected and removed.

--

--

Nibesh Khadka
Analytics Vidhya

Software Developer, Content Creator and Wannabe Entrepreneur