Pic Credit: Google | Statistics, and Probability in Data Science

Part 1: Statistics and Probability in Data Science | Data Science 2020

Aman Kapri
Analytics Vidhya

--

Statistics in Machine Learning

The main idea of this article is to present all the important concepts, terms, and terminologies of statistics and probability that we use in machine learning.

Statistics and Probability are an important aspect of machine learning when it comes to understanding the operations that we do on an ML problem.

Let us consider a few simple examples of why statistics and probability are necessary for ML.

If in a dataset, there is a numerical attribute and few of the values in it are missing, then we can impute those missing values with the mean or median of that attribute. To get this task done, we must know what mean/median/mode is.

If we are dealing with classification problems where we need to classify whether the mail is a spam or not, the major criteria for splitting the class is the probability of occurrence of that class. Here, we must understand the basics of probability and how conditional/joint probability works.

There are many more scenarios where we use statistics and probability in solving day-to-day ML problems.

Now, let us start with the basics concepts that are present in statistics and probability.

What is Statistics?

Statistics is nothing but the analysis of past data that uses quantified models, representations and synopses for a given set of experimental data or real-life studies. Statistics studies methodologies to gather, review, analyze and draw conclusions from data.

What is Measurement?

The data can be defined in terms of measurement. Measurement is a method of assigning numbers. There are four possible ways of doing it that, known as a scale of measurement.

1. Nominal: The assigned numbers have no meaning. It can be any number in any order.

E.g. Man -1, Female -2, Transgender -3…
Maths -1, Physics -2, Chemistry -3…

2. Ordinal: The assigned number has meaning but in order.

E.g. Jan -1, Feb -2, March -3, April -4…

Monday -1, Tuesday -2, Wednesday -3…

3. Interval: The difference between the assigned numbers has meaning.

E.g. Calendar, Time, Temperature…

Zero does not have any specific meaning.

4. Ratio: The number itself carries meaning.

E.g. Mass/Distance, Profit/Loss, Weight…

Absolute zero is defined.

Key points to remember:

· Nominal Data can be represented in Bar chart /Pie chart. It is mostly categorical data and we cannot calculate the mean of it. We calculate mode in such a scenario. The percentage can be taken to analyze the data.

· Ordinal Data can be represented in Bar chart but not Pie chart. Calculating mean does not make any sense in this type of data as well, therefore we calculate the mode. The percentage can be taken to analyze the data.

· Interval/Ratio: Mean, Median, Mode can be calculated. It can be represented in Bar/Histogram. The spread of the data can be measured using the Box-plot.

Types of Data

The data can be broadly classified into two types namely, categorical and numerical.

Types of Data

Basic Statistical Terminology

Population: The entire data that is being analyzed is considered as population.
Sample: The subset/part of the population.

Census: Gathering data from the entire population.
Survey: Gathering data from the sample in order to make a conclusion about the population.

Parameter: The descriptive measure of population. (Population mean, Population standard deviation, etc.)
Statistic: The descriptive measure of the sample. (Sample mean, Sample standard deviation, etc.)

Variable: A variable is an object, event, idea, feeling, time-period, or any other type of category you are trying to measure. There are two types of variables- independent and dependent.

Independent variable(s): They are stand-alone variables and it does not depend on any other variables.

· They are also known as classifiers (in machine learning context).

· In the 2-D graphical representation, independent variables are represented on X-axis.

Dependent variable: As the name suggests, this type of variable is dependent on other variables.

· It is also known as the target variable.

· In the 2-D graphical representation, the dependent variable is represented on Y-axis.

Descriptive Statistic: Data gathered from a group to reach a conclusion about the same group.

Inferential/Inductive Statistic: Data gathered from the sample and analyzed to reach the conclusion about the population.

Central Tendency

Central tendency is a descriptive summary of a dataset through a single value that reflects the center of the data distribution.

Measures of Central Tendency

Generally, the central tendency of a dataset can be described using the following measures:

  • Mean (Average): Represents the sum of all values in a dataset divided by the total number of the values.
  • Median: The middle value in a dataset that is arranged in ascending order (from the smallest value to the largest value). If a dataset contains an even number of values, the median of the dataset is the mean of the two middle values.
  • Mode: Defines the most frequently occurring value in a dataset. In some cases, a dataset may contain multiple modes while some datasets may not have any mode at all.

Key points to remember:

· Mean and median need not be present in the dataset but the mode has to be in it.

· Mode is the only central tendency statistic that works with categorical data.

· Median works best with ordinal data.

· Although mean is regarded as the best measure of central tendency for quantitative data, it is not always the case. For example, mean may not work well with quantitative datasets that contain extremely large or extremely small values.

The extreme values may distort the mean. Thus, you may consider other options of central tendency.

Box and Whiskers Plot

· Box and whiskers plot allows us to visualize the spread of the data.

· It is the standard way of displaying the distribution of data on a five-number summary, minimum, first quartile, median, third quartile, and maximum.

Summary of Box Plot

Note:

·Division of data in 4 groups each containing equal proportion (25%) of data is called quartile.

· Division of data in 100 groups each containing equal proportion (1%) of data is called quantile or percentile.

· Q3 — Q1 is called the interquartile range (IQR) which contains approximately 50% of the data.

· Minimum = Q1–1.5*IQR and Maximum = Q3 + 1.5*IQR

I will be posting the rest of the topics that we use in statistics and probability in upcoming articles.

Thanks for reading!

Please note that the purpose of this article is just to merge all the important topics in one place.

Information Source: Google, Medium

--

--