Machine Learning Bootcamp Series- Part2: Applied Statistics

Tavva Prudhvith
10 min readOct 17, 2022

--

In the previous article, we discussed fundamentals of Machine Learning. Now, we will proceed further with basics of Statistics for any ML network.

In Part-2, you will learn,

Let’s dive in🤗

What is Statistics?

It’s a branch of science that deals with numeric data which is mainly used to infer knowledge about big proportions of data by trying to understand smaller part of it.

Here, comes the concept of Population & Sample, i.e., a population is considered an entire data that you want to draw conclusions about, whereas a sample is just a subset of the Population. Let’s take an example,

Population: Advertisements for IT jobs in India.

Sample: The top 50 search results for advertisements for IT jobs in the India on Sep 9th, 2022

Hence, the size of the sample is always less than the total size of the population.

In research & statistics, a population doesn’t always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organizations, countries, species, etc.

Why do we need Statistics?

The major reason is that it helps us in understand data better, when you understand the data better we bring value to our data by generating better insights, & when you work on it better, our deliverables after working on the data turns out to be better.

And, All of this is possible when we can use some statistical method to understand the data🧐.

Types of Variables in Statistics

In statistical research, a variable is defined as an attribute of study.

Example: If you want to test whether a person is having heart disease or not , some key variables you might measure include the age, gender, resting blood pressure, cholesterol, etc.

You need to know which types of variables you are working with in order to choose appropriate statistical tests and interpret the results of your study.

You can usually identify the type of variable by asking, “What kind of data do we deal with?” & below is the pictorial

Types of Data

Data is generally divided into two categories:

  • Quantitative data represents amounts.
  • Categorical data represents groupings.

Each of these types of variable can be broken down into further types,

Quantitative Variables

When you collect quantitative data, the numbers you record represent real amounts that can be added, subtracted, divided, etc. There are two types of quantitative variables:

  • Discrete(aka integer variables): Counts of individual items or values. Eg: Number of students in a class, Number of different tree species in a forest.
  • Continuous: Measurements of continuous or non-finite values. Eg: Distance, Volume, Age.

Categorical variables

Categorical variables are sometimes recorded as numbers, but the numbers represent categories rather than actual amounts of things. There are three types of categorical variables:

  • Binary: The word Bi means two, so a problem with 2 outcomes like Yes/no outcomes. Eg: Heads/tails in a coin flip, Win/lose in a cricket game
  • Nominal: Groups with no rank or order between them. Eg: Colors with red, blue, green doesn’t have specific order like red > blue < green, right? Hence, we say color attribute is Nominal.
  • Ordinal: Groups that are ranked in a specific order. Eg: Finishing place in a race, there is an order in a race i.e., 1st > 2nd> 3rd Hence, this is Ordinal.

Descriptive Statistics

Descriptive statistics summarize and organize characteristics of a given data set, which can be either a representation of a population or a sample.

In research, after collecting data, the first step of statistical analysis is to describe characteristics of the variables, such as the average of one variable (e.g., age), or the relation between two variables (e.g., age and gender).

Inferential statistics are the next phase, which aid in determining if your data supports or contradicts your hypothesis and whether it can be applied to a broader population.

Types of descriptive statistics

  • Measures of Central tendency
  • Measure of Variability(spread)

You can apply these to assess only one variable at a time, in univariate analysis, or to compare two or more, in bivariate and multivariate analysis.

1. Measures of Central tendency

When we say measures of central tendency, we basically are talking about 3 things, the mean, median & mode. Here we will demonstrate how to calculate the mean, median, and mode using the first 6 responses of a survey.

a. Mean(M), is nothing but the average value of our population or sample.

Mean Formula
  • Dataset: 15, 3, 12, 0, 24, 3
  • Sum of all values(ΣX) = 15 + 3 + 12 + 0 + 24 + 3 = 57
  • Total number of responses(N) = 6
  • Mean(ΣX/N) = 57/6 = 9.5

Mean is prone to Outliers, why? Let’s say we add a new point 1000 to our dataset, then our mean would be around 176. The mean value is being pulled to the largest value here even though we have smaller values, hence “1000 is called an OUTLIER as it’s far from our initial stream of data.

b. Median is the middle observation, after the observations are arranged in the ascending order. In this way, it’s not effected my outliers from any direction.

  • Dataset: 15, 3, 12, 0, 24, 3
  • Ordered data set: 0, 3, 3, 12, 15, 24
  • Median = Find the mean of the two middle numbers: (3 + 12)/2 = 7.5

c. Mode is the simply the most frequent response value.

  • Dataset: 15, 3, 12, 0, 24, 3
  • Mode: Find the most frequently occurring response: 3

2. Measures of Variability

Measures of variability gives us an idea of how spread out the data values are. Let’s check the below graph with 2 variable distributions,

Just because the data has Mean=Median=Mode, that doesn’t mean the above two distributions are similar, right?

Therefore, The range, standard deviation and variance each reflect different aspects of spread in our dataset.

a. Range gives you an idea of how far apart the most extreme response scores(minimum, maximum) are.

Range = Maximum value — Minimum value

  • Dataset: 3, 3, 12, 15, 24
  • Range = 24–3 = 21

Why range isn’t the best way to understand spreads?

Like we saw earlier the best of methods also fail to give us the correct interpretations of the dataset.

There might be cases, where mean=range, but the values differ telling us that the data might have a different spread.

If we go with the ranges all the time, we might end up with a completely different interpretation. So, let’s look at the other ways of finding out spread.

b. Variance is a measure of how data points are spread out from the mean. So, The more spread the data, the larger the variance is in relation to the mean.

Variance formula for Population & Sample

Steps to find Variance:

  1. Find their mean(M)
  2. Subtract the mean from each score to get the deviation from the mean
  3. Square each of these deviations
  4. Add up all of the squared deviations.
  5. Divide the sum of the squared deviations by N or n-1(for sample).

Step 5: 421.5/5 = 84.3

The variance seems to be larger here, hence the data is more spread across the mean.

c. Standard Deviation is simply the root of the variance.

S.D = √84.3 = 9.18, What does this mean???

From learning that S.D = 9.18, you can say that on average, each score deviates from the mean by 9.18 points.

Why S.D & What’s the Problem with Variance?

Let’s say our dataset above are the marks of students. Now, when you take variance into consideration, we get final value in terms of Marks²(squared) it’s like Ram got 20² marks, which doesn’t make a lot of sense🤔, hence we take root to get Marks.

That’s the reason, we use S.D over variance even though S.D is just a square of variance.

Let’s study few mathematical concepts before diving into further concepts.

Probability

Basically, it tells you the chances of an event to occur or not occur. We can quantify it in terms of a fraction, a decimal or in terms of percentages.

The value is always between 0 to 1.

How do we calculate it?

A = Event you want to find the probability for i.e., it could be rolling a dice, tossing up a coin, etc.

P(A) = Probability(P) of that event.

Example: What is the probability of getting a head when tossed a coin?

Ans: As we know, there are only 2 outcomes when you toss a coin — Heads & Tails. Hence, Total Number of outcomes = 2.

Event A = Getting a head, we can only get 1 head out of Head/Tail, hence P(A) = 1/2.

Union & Intersection

Let’s say we have 2 events A, B denoting 2 circles in the below graph,

  • Intersection(denoted with ∩) = P(A∩B) = intersection area of A and B
  • Union(denoted with ∪) = P(A∪B) = the entire area of A and B

Conditional Probability

This is an extension of probability, where we try to find the probability of an event based on another event that has already occurred.

Formula: P(B|A) = P(B∩A)/P(A)

Example: Two dies are thrown simultaneously, and the sum of the numbers obtained is found to be 7(given Event B). What is the probability that the number 3 has appeared at least once(Event A)?

So, given the event B, we have calculated probability of event A!!!

Probability Distributions(P.D): Discrete & Continuous

Before, that giving you a heads up on random variable as they play a vital role & serve as a base for Probability distributions.

A random variable is a variable(let’s say an attribute) that can hold different set of values from the outcome of any random process.

There’s special notation you can use to say that a random variable follows a specific distribution:

  • Random variables are usually denoted by X.
  • The ~ (tilde) symbol means “follows the distribution

For example, the following notation means “the random variable X follows a normal distribution with a mean of µ and a variance of σ2.”

Probability Distribution is a mathematical function which lists the probability of different possible outcomes, a random variable can take.

Ex: P.D for a discrete random variable, is a list of values of different outcomes & their respective probabilities(they add up to 1🙃) as shown below.

If a random variable can take only finite set of values(ex., heads/tails, 0/1/2), then it’s P.D is called as Probability Mass Function(PMF).

A continuous variable can have any value between its lowest and highest values. Therefore, continuous probability distributions include every number in the variable’s range.

P.D of continuous random variable is called as Probability Density Function(PDF).

Common discrete probability distributions

  1. Binomial Distribution — 2 Possible Outcomes

A Binomial Distribution can be simply thought of as the probability of a SUCCESS or FAILURE outcome in an experiment that is repeated multiple times.

Ex: A coin toss has only 2 possible outcomes — Heads/Tails, or Taking a test could lead to Pass/Fail.

2. Poisson Distribution

A Poisson Distribution is used to show how many times an event is likely to occur over a specific period.

Ex: The number of text messages received per day.

Common Continuous probability distributions

  1. Normal Distribution(also known as Gaussian Distribution)

A Normal Distribution is symmetric around the mean, showing that the data near the mean are more frequent in occurrence than the data far from the mean.

Normal Distribution(Bell-Shaped) Curve
  • In the above graph, Normal Distribution will appear as a bell curve.
  • From the above graph, ~68% of the observations fall within +/- 1 Standard deviations of the mean i.e., 68% of the data lies in [-1,1], similarly ~95% of the observations fall within +/- 2 Standard deviations, and 99.7% of the observations fall within +/- 3 Standard deviations.

Many things closely follow a normal distribution,

  • Heights of people
  • Errors in Measurement
  • Blood Pressure, etc.

Standard Scores

Conclusion

This series has been a tutorial to demonstrate Machine Learning for beginners and I’ll try to upload the articles as frequently as possible.

I hope you enjoyed it! Feel free to contact me for questions and feedback or just to share your interesting projects.

Contact me

Need help with Data Science? Contact me at prudhvithtavva@gmail.com

Would you want to get a regular feed of fascinating Data Science resources?

Follow me on👉 Linkedin 👈

--

--

Tavva Prudhvith

NLP Data Scientist @Genpact | 2X GCP Certified | Generative AI | LLMs