Analytics Vidhya
Published in

Analytics Vidhya

Probability & Statistics : Bed Rock Of Machine Learning

P.S Don’t this to yourself

Machine Learning is an interdisciplinary field that uses statistics, probability, algorithms to learn from data and provide insights which can be used to build intelligent applications

Probability of events A and B denoted byP(A and B) or P(A ∩ B)is the probability that events A and B both occur. P(A ∩ B) = P(A). P(B) . This only applies if Aand Bare independent, which means that if Aoccurred, that doesn’t change the probability of B, and vice versa.

Prob(X=x, Y=y)

“Probability of X=x and Y=y”

p(x, y)

Let us consider A and B are not independent, because if A occurred, the probability of B is higher. When A and B are not independent, it is often useful to compute the conditional probability, P (A|B), which is the probability of A given that B occurred: P(A|B) = P(A ∩ B)/ P(B).

The probability of an event A conditioned on an event B is denoted and defined P(A|B) = P(A∩B)/P(B)

Similarly, P(B|A) = P(A ∩ B)/ P(A) . We can write the joint probability of as A and B as P(A ∩ B)= p(A).P(B|A), which means : “The chance of both things happening is the chance that the first one happens, and then the second one given the first happened.”

Prob(X=x|Y=y)

“Probability of X=x given Y=y”

p(x|y) = p(x,y)/p(y)

Bayes’ Theorem

It is one of the most important formula in probability theory.

Family of probability distributions

  • Many of the standard distributions belong to this family — Bernoulli, binomial/multinomial, Poisson, Normal (Gaussian) etc.

Populations and samples

A population is thus an aggregate of creatures, things, cases and so on. For instance population of any town or when we need to calculate average height of any city we will calculate sum of everyone’s height and then divide by number of persons.

It is not possible to get height detail of each and every person so we will calculate average height by taking few people’s height.

As we increase size of our sample, sample mean comes closer to population mean.

Random Variables

A random variable is a measurable function which can be of two types →

  • Discrete Random Variable — When there are only whole numbers not floating numbers. For example: Number of students in a class.
  • Continuous Random Variable — It can be anything within a specific range.

Outliner

An outlier is a data point in a data set that is distant from all other observations.

What is the reason for an outlier to exists in a dataset?

  1. Variability in the data
  2. An experimental measurement error

What are the impacts of having outliers in a dataset?

  1. It causes various problems during our statistical analysis
  2. It may cause a significant impact on the mean and the standard deviation

Various ways of finding the outlier.

  1. Using scatter plots
  2. Box plot
  3. using z score
  4. using the IQR interquantile range

Lets see how to use z score !!

Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

def detect_outliers(data):
outliers=[]
threshold=3
mean = np.mean(data)
std =np.std(data)


for i in data:
z_score= (i - mean)/std
if np.abs(z_score) > threshold:
outliers.append(y)
return outliers

Normal Distribution

Also known as Gaussian Distribution

Most of the data in the world are in normal distribution. A normal distribution, sometimes called the bell curve, is a distribution that occurs naturally in many situations. For example, the bell curve is seen in tests like the SAT and GRE. The bulk of students will score the average while smaller numbers of students will score a B or D. An even smaller percentage of students score an F or an A.

These are some real life examples of normal distribution:

  • Heights of people.
  • Measurement errors.
  • Blood pressure.
  • Points on a test.
  • IQ scores.
  • Salaries.

The empirical rule tells you what percentage of your data falls within a certain number of standard deviations from the mean:
• 68% of the data falls within one standard deviation of the mean.
• 95% of the data falls within two standard deviations of the mean.
• 99.7% of the data falls within three standard deviations of the mean.

Log Normal Distribution

A continuous distribution in which the logarithm of a variable has a normal distribution.

In simple words if log x is normally distributed. For instance income of people.

Covariance

Covariance and Correlation are very helpful in understanding the relationship between two continuous variables. Covariance tells whether both variables vary in the same direction (positive covariance) or in the opposite direction (negative covariance).

Learn about Statistics more in

Thanks For Reading :)

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store