Introduction to Probability and Statistics

Vinay Kudari
hacking-datascience
3 min readAug 6, 2018

Why Probability and Statistics?

We live in a very uncertain world, we are unsure about our next minute. In this kind of world probability and statistics provide a rational way to deal with uncertainty.

The fastest route may not be the shortest.

It is certain to find the fastest route between point a and point b, it can be found my measuring the distance between them, while it is uncertain to find the fastest route as it depends on various factors such as traffic at that moment, time of the day and also the day.

Probability Theory

We all know that the probability of getting head and tail are equal to 1/2, So what does this actually mean? If suppose we toss a coin 1000 times we I’ll have the heads approximately about 500 times. Let me simulate this using python code.

histogram of generate_counts()

Head is counted as +1 and tail as -1 so if three times head and 2 times tail showed up then the sum is 1, if the no of tail and heads are same, the sum is o, from the histogram we can say that sum is ~0 for maximum no of times.

                  sum = x1 + x2 + x3 + ...... xk

After doing some stimulations of summing k numbers, xi = 1 with probability 1/2 and xi = -1 with probability 1/2 we see that the sum is almost always in the range [-4√k, +4√k] as k tends to a large number sum/k tends to 0.

What is Statistics?

Statistics is about analyzing the real-world data and drawing conclusions. For example if you would like to predict which party is going to win elections one way is to ask all potential voters about their opinion, but it would be extremely expensive instead we could select random unbiased group and take their opinions, now this exactly equivalent to coin problem.

Frequentist Statistics draw conclusions from repeating the event multiple number of times under same conditions and comparing the frequency of each outcome, just like flipping a coin. The result entirely depends on the number of times the event is repeated.

Bayesian Statistics comes into the play where we cannot perform multiple experiments under same set of conditions, like in diagnosing the patient. He might have unique set of medical conditions which the doctor might not have encountered in the past.

Homework Notebook can be found here. 😄

--

--