How Statistics is used in Machine Learning?

Published in

CodeX

3 min readApr 1, 2022

I come up with a question like why Statistics is so important in Machine Learning or Data Science to build any modal. For that, I did some research behind that. This blog holds the content regarding statistical importance in ML models.

What are statistics?

The statistic is one of the steps where we get some meaningful full information from raw ( junks ) data, performing some math or statistical analysis.

Definition

Statistics is a branch of science that involves the collection, analysis, and data in large quantities So that you can come up with solving various use cases and conclusions and extract some meaningful full information that helps you in your prediction through ML modals.

In statistics data is divided into two parts

Descriptive
Inferential

What is Descriptive?

Descriptive statistics comprehension the characteristics of a data set. Descriptive statistics hold two basic categories of measures: measures of central tendency and measures of variability. Measures of central tendency describe the central location of a data set.

Inferential

Inferential statistics are used to generate generalizations about a population using data from samples.. … This means taking a statistic from your sample data (for example the sample mean) and using it to say something about a population parameter (i.e. the population mean).

Population

Generally, population refers to the people who live in a certain area for a specific time. But in statistics, population refers to data on your study of interest. It can be a group of individuals, objects, events, etc. You use populations to conclude.

For example, in the exit poll, it is not possible to gather all given votes before the election ends, the exit poll predicts this through the group of people. This is the same goes for Sampling too.

Sampling

The sampling data is used to predict favors for all populations when we can’t get population data, so we get data from different-different fields’ opinions as data.

Sampling Techniques

Random Sampling:

They randomly get selected quite well but hold some cons like
Overlapping
For specific use-case, it won’t work

Stratified Sampling:

This sampling is used when you want to target those certain groups that indulge most. For example, beauty products were this kind of company targeting women. When we gather the data we avoid unnecessary categories.

Systematic Sampling:

Systematic sampling is a probability sampling method in which a random sample, with a fixed periodic interval, is selected from a larger population.

The measure of central tendency

Central Tendency is the summary of the data set that you calculate using Mean, Mode, and Median. It tells you the most average value and it’s also called “Center Location of Data”.

let me show a few examples for all.

Mean:

When the record holds values then mean is used. i.e. age = [ 33, 22, 55, 44, 55, 44 43] , mean = total age / number of records . 296 / 7 = 42.7

Median:

For instance, age = [ 5, 4, 11, 15, 11, 9 90], where the average age is between 5–10 but because of 90 Mean value is 20.0
which is not valid, Median get center value, for odd center value and for even add (two center value) /2 from center, and for odd take center Median = 15 from above example.

Mode:

The record holds repeated values, then we pick that react value as our Mode. i.e. age = [2, 3, 5, 6, 7, 3, 3], Median = 3

At Bigscal Technologies, you can Hire Machine Learning Developers, Hire ML Engineers, and Hire Data Science Developers and save up to 60% on costs and time, with no hiring fees.

This article was first published by Rushikesh Chittes here.
Continue reading for more interesting articles by clicking here.