Statistics Simplified for data science

Harika kanthi P
15 min readJul 22, 2020

--

Hello everyone!!

Welcome to this session of statistics which is simplified. These statistical measures are used in the data analysis platform for gaining knowledge from the data.

Statistics plays a very crucial part in understanding the data. It was developed in the 17th to 19th centuries. Challenges in the areas of data analysis and organisation have been given a scope due to the statistical influence in data science. Vast amounts of data is being generated and it is the job of the researcher to make sense of the data. It helps us understand the main question “What does the data tell?”

Statistics can be divided into three categories:

· Descriptive statistics: The statistical measures that help describe the data. Often used to present the results.

· Inferential Statistics: The statistics that uses the descriptive measures to infer the data and patterns related.

· Predictive Statistics: The statists that help predict the event based on the inferences from the inferential and the descriptive statistics.

Main terminologies in statistics:

  • Population: The complete set of observations that belong in the study based on the criteria chosen.

e.g. People who smoke in Hyderabad of age 20

It includes all people with age 20 and from Hyderabad and they smoke.

  • Sample: The part of a population which has been extracted or selected such that it represents the population that is it follows the same criteria as the population. It is usually selected through different means as selecting the entire population is computationally impossible.

e.g. From the same example from population where criteria are people of age 20 from Hyderabad who smoke.

In this we select a part of population of about let’s say 20000 people who are of age 20, from Hyderabad and smoke.

Parameter: The characteristics of a population are called parameter.

e.g. mean of population is denoted as “μ “

standard deviation is denoted as “σ “

  • Statistic: The characteristics of a sample are called statistic.

e.g. mean of a sample is denoted as “x̄”

standard deviation is denoted as “s”

  • Variable: A temporary storage where we can store the values.

e.g. x = 5 here, x is called the variable.

  • Experiment: The activity which has a result is called an experiment or survey.

e.g. tossing a coin has a result of head or tail.

The tossing is called the experiment.

  • Outcome: The result of an experiment is called as an outcome.

e.g. When a coin is tossed, and we get a head.

The head is the outcome of the experiment.

  • Frequency: The count of the number of observations in that category or the interval.
  • Outliers: The values that are inconsistent with data are called outliers.

e.g. if manual labour earns between 10 and 50 maxima but can never earn more than it, but a data shows a person with 1000 as the income that is said to be an outlier.

  • Extreme Values: The value that are very low or very high are called extreme values. They are not actually outliers as they stay consistent with the remaining part of the data.

e.g. if any manual labor earns anywhere between 10 to 50 in general. But a person earns 55 then it would be an extreme value.

SAMPLING METHODS

The methods used to select the data from the population.

A sample frame is the elements belonging to a population from which we are extracting the sample.

A sample design is the procedure we use to select the elements from the sample frame.

The selection methods can be of two different types in general:

1. Probabilistic sampling

The sampling where each observation has an equal probability of selection.

2. Non-probabilistic sampling

The sampling where observations do not share an equal probability of being a part of the sample.

Probabilistic Sampling

The sampling methods can be further classified into

  • Simple Random Sampling

In this method of sampling, we randomly select the sample without any procedure associated to the selection.

Advantages: very easy, highly random.

Disadvantages: might not be a right representation of population.

  • Systematic Sampling

In this method of sampling, we select the sample using a system. Like selection at regular intervals.

e.g. selecting every 10th element in the data.

Advantages: easy, efficient

Disadvantages: if any underlying pattern (cycle or repetitive) exists, it will not be a good method as it will be biased.

For instance, if every 10th observation has diabetes then all the sample will be of diabetic people.

  • Stratified Sampling

This method first groups the data into strata based on similar characteristics. From each stratum randomly data is selected.

Advantages: reduction in bias.

Disadvantages: prior knowledge in creating the strata.

  • Quota Sampling

An extension of stratified sampling. It divides the data into strata. Then from each stratum it selects the data in such a proportion as the strata is to the population.

Also called Proportional sampling

e.g. Maths, Science, Social have 30%, 45% and 25% in the population.

From the 3 strata created, of maths 30% is selected, of science 45% and of social 25%.

  • Cluster Sampling

This method is also an extension of stratified sampling. After the strata is created, it selects samples from only few strata but not all of them.

Advantages: efficiency,

Disadvantages: Sampling and bias errors

TYPES OF DATA

The data can be numerical or categorical in general.

Quantitative (Numerical): The data which is numerical in nature is called as quantitative data. It can be again categorised into discrete and continuous.

Discrete: When the data can take only integer forms. For instance, the count of people born on a day. It cannot be decimal and is finite.

Continuous: The data which can take any value between the interval. For instance, height of a person can lie anywhere in an interval.

Qualitative (Categorical): The data which is in the form of levels is called as qualitative data.

Nominal: The data which do not have an order associated to them. It can be binomial or multinomial.

Binomial is when there are only two categories in the data.

e.g. True/False, 0/1, male/female

Multinomial is when there are more than two categories in the data.

e.g. petrol/ diesel/ gas

Ordinal: The data with categories where the categories have an order associated with them.

e.g. small/ medium/ large

DESCRIPTIVE STATISTICS

The data can be described using different measures to understand the data.

They can further classify as

1. Measures of central tendency

2. Measures of dispersion

Measures of Central Tendency

The measures of central tendency describe where the central part of the data lies.

  • Mean

The mean is also called the average. It tells us the division point for the data.

It is given as the sum of observations divided by the number of observations.

Mean = sum of observations / number of observations

It is a calculated field.

Mean is very sensitive to the extreme values and is not considered a best metric to understand the data.

13, 18, 13, 14, 13, 16, 14, 21, 13

Mean = 13+18+13+14+13+16+14+21+13 / 9 = 15

  • Median

The median is the middle most value in the distribution of the data.

Sort the data in ascending order

If number of observations is odd,

Median = n+1/2 th observation

If number of observations is even.

Median = average of n/2 th observation and n/2 +1 th observation

13, 18, 13, 14, 13, 16, 14, 21, 13

13, 13, 13, 13, 14, 14, 16, 18, 21

N = 9

Median = 9+1/2 = 5th observation = 14

It is a positional field.

Median is also subject to outliers or extreme values. It is affected by extreme values but not as much as the mean. Hence is a better metric than the mean in situations where mean is not a good metric.

  • Mode

The most frequent value in the data is called as mode. It is a metric used for categorical features. It is not usually considered in case of numerical features.

Unimodal: The data with only one mode

Bimodal: The data with two modes

Multimodal: The data with more than two modes

13, 18, 13, 14, 13, 16, 14, 21, 13

13–4

14–2

16–1

18–1

21–1

Mode = 13

It is neither calculated nor a positional field.

Mode is not subject to outliers or extreme values. It is not affected by extreme values like mean or median.

Measures of dispersion

The measures of dispersion describe the spread of data that is it helps understand how the data is distributed

  • Range

The range is the difference between minimum and the maximum values in the data.

Range = Maximum — Minimum

13, 18, 13, 14, 13, 16, 14, 21, 13

Range = 21–13 = 7

It is affected by outliers in the data. Hence is not considered a good measure for understanding the data.

  • Deviation

The difference between the observed value and the estimate of the location usually the mean. They are also called as Errors or Residuals.

Deviation = Observed value — Estimate

13, 18, 13, 14, 13, 16, 14, 21, 13

Mean = 15

13–15 = -2

14–15 = -1

16–15 =1

18–15 =3

21–15 =6

  • Standard deviation

This measure how deviated the observations are from the mean and from each other.

Standard deviation of population:

Statistical deviation of sample:

Standard deviation is also affected by the outliers. Since it calculated the deviations which are also affected by the outliers.

We use n-1 to calculate for sample measures rather than n to account for the degrees of freedom. Using n with undercalculate the statistic but with n-1 will be a better explanation. The degrees of freedom are the number of observations which can be varied that is which can be moved freely. N-1 accounts to the observations that can move freely as one observation is kept constant.

  • Variance:

The variance is also a spread of the data. It tells us on an average how deviated are the data points from the mean.

Variance = standard deviation ^ 2

  • Quartile:

The data when divided into 4 equal parts gives us the quartiles. It is one of the measures of calculating the extreme values in the data. Ever quartile has equal number of observations. They contain 25% of the observations.

1st quartile = n+1/4 th observation

2nd quartile = Median = n+1/2 th observation

3rd quartile = 3(n+1)/4 th observation.

  • Inter quartile range

The interquartile range is the difference between the third and the first quartiles. It tells us where the 50% of the data lies.

IQR = 3rd quartile — 1st quartile

It is considered as one of the methods to remove the extreme values in the data.

It is given as

Lower limit = 1st quartile — 1.5*IQR

Upper limit = 3rd quantile + 1.5*IQR

Any value lower than lower limits or any value greater than the upper limit is considered as outliers.

  • Skewness

The skewness is the measure of the asymmetry of the data.

Any data can have different kinds of distributions like normal, binomial etc.

In case of normal distribution, the skewness = 0 since there symmetry around the mean.

Skewness = 3(Mean-Median) / standard deviation

Skewness can be of two types:

Right skewness / Positive skewness:

skewness>0

Mean > Median >Mode

It happens when there are few extreme values on the right tail of the data.

Left skewness / Negative skewness:

Skewness<0

Mean< Median< Mode

It happens when there are few extreme values on the left side of the data

No skewness:

Skewness =0

Mean = Median = Mode

It happens when data is normally distributed

  • Kurtosis

It is a measure of tailedness or the peakedness in the data. It tells us what kind of distribution the data has.

Positive kurt: Leptokurtic:

Kurt >3

Excess kurt >0

Excess kurt =kurt -3

When the data is leptokurtic then the data lies very near to the mean or in other words, the standard deviation is low.

Lepto means skinny and the distribution looks like a long peak.

It indicates that the extreme values or outliers are large.

Mesokurtic: No kurt:

Kurt = 3

Excess kurt =0

When the data is mesokurtic then the data lies symmetric around the mean, in other words it is normally distributed.

It indicates that there are no outliers or extreme values in the data

Platykurtic: Negative kurt:

Kurt < 3

Excess kurt <0

When the data is platykurtic then the data lies far from the mean, in other words the standard deviation is high.

Platy means broad and hence distribution looks like a broad distribution.

It indicates that the extreme values or outliers are small.

  • Covariance

The covariance is a measure of direction of movement of the variable in respect to each other.

It only tells us the direction of movement of the variables with each other

It is not interpretable.

It can be positive or negative.

When covariance >0 then with an increase in one variable there will be an increase in the other variable

When covariance <0 then with increase in one variable there will be a decrease in the other variable

  • Correlation

Also called the coefficient of covariance

This is a measure of both the direction and the magnitude of the movement of the variables in respect of each other.

Correlation is preferred to covariance as it is interpretable.

It always lies between -1 and 1

If correlation =0 then there is no relation between the variables

If correlation =1 then with a unit increase in one variable there is a unit increase in another variable.

If correlation =-1 then with a unit increase in one variable there is a unit decrease in another variable.

Correlation does not mean causation.

Causation means an evet that happens because another event happens that is it is a pre-established relationship.

Correlation only implies statistically they move but not as a cause is involved.

PROBABILITIES

A branch of mathematics that lets us understand the events and their probabilities. It is used along with statistics to gain more insights into the data.

TERMINOLOGIES

  • Experiment: The experiment is the procedure that is conducted repeatedly.

e.g. tossing a coin, rolling a dice

  • Event: The event that either happens or not because of the experiment.

e.g. event = getting head

  • Outcome: The result of an experiment is called an outcome.

e.g. head on tossing a coin, 1 on rolling a dice

  • Sample space: The collection of all the outcomes from an event is called as a sample space.

e.g. sample space = {head, tail} for tossing a coin

sample space = {1,2,3,4,5,6} for rolling a dice

  • Impossible events: The outcome which is not possible from the event.

e.g. getting a 7 when a dice is rolled

  • Success: when an outcome matches our expectation, it is a success
  • Failure: when an outcome is contradictory to our expectation, it is a failure.
  • Probability: The chance of an event happening.

It is calculated as number of successes over the total number of events.

Probability = number of successes / number of events

It always lies between 0 and 1.

Probability =0 when there are no successes

Probability =1 when all are successes

Area under the probability curve is 1

sum of all probabilities =1

e.g. probability of getting 2 when we roll a dice

sample space ={1,2,3,4,5,6}

expectation =2

probability = number of 2's/ sample space = 1/6

e.g. tossing a coin

probability(head) + probability(tail) = total probability

total probability = 0.5 +0.5 = 1

Basic probability rule:

P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B )

TYPES OF PROBABILITIES

  • Independent events: the events A and B are not dependent on each other. When the select with replacement then it becomes a independent event.

P ( A ∩ B ) = P ( A ) * P ( B )

This probability is equal throughout the event. That is the outcome will not change depending on any action

e.g. getting a head the 5th time we toss a coin.

No matter how many times a coin is tossed, the probability of getting a head remains same 1/2 =0.5

  • Mutually exclusive events: The events A and B cannot happen at the same time. For instance, getting a head and a tail when we toss a coin is impossible. Or, turning left and right also cannot be done at the same time.

P ( A ∩ B ) =0

P ( A ∪ B ) = P ( A ) + P ( B )

e.g. probability of getting king and queen from the deck of cards when we select a card.

P( king and queen)=0

probability of getting queen or king from a deck of cards when we select a card

P( king or queen) = P(king) + P(queen) = 4/52 + 4/ 52 = 2/13 = 0.154

  • Dependent events: The events A and B are dependent on each other. That is these events are affected by previous events. A selection without replacement is usually dependent events.

e.g. selecting a card from the deck of cards after selecting a card.

P(1st selection) = 1/52

P(2nd selection) = 1/51

Since a card has been picked, the space decreases and as a result is affected by previous selection.

  • Conditional probability: The event which happens because of the outcome of another event is called conditional probability. That is the event happening provided a condition is being satisfied.

P(B|A) = P(A and B) / P(A)

e.g. 80% of the students like maths,40% of the students like probability and mathematics. Probability of liking probability given they like mathematics.

P( p |m) = P( p and m) /P(m) = 0.4/0.8 = 0.5

  • Posterior probability: The probability of an event that will happen given the prior evidence is given.

P(C|X) = P(X|C)*P(C) / P(X)

Also called the bayes theorem, and plays a major role in naive bayes algorithm.

e.g. Of 100 people we can divide them into 70 males and 30 females. Among the males, 35 wore sweater and 35 did not wear sweater. Among the females, 20 wore sweaters and 10 did not.

probability of man wearing sweater

P( M|S) = P(M)* P(S|M) / P(S)

P(S) = 20+35 = 55 / 100 =0.55

P(M) = 70/100 = 0.7

P(S|M) = P(M and S) / P(M) = 35/70 = 0.5

P(M|S) = 0.7*0.5/0.55 = 0.636

DISTRIBUTIONS

The mathematical function that helps us understand how the data is distributed. We can plot then using graphs where x axis is the variable and y axis is the frequency.

TYPES OF DISTRIBUTIONS:

The most used distributions in data science are:

  1. Binomial distribution
  2. Poisson distribution
  3. Normal distribution

Binomial distribution

This distribution is the distribution of the outcomes from a series of events which are finite and the outcome for each event is constant and independent of prior events.

  • Two outcomes: Success or Failure
  • Finite set of events
  • Constant probability of success
  • independent events

This distribution follows the binomial theorem.

b(x,n,p)= nCx*Px*(1-P)n-x for x=0,1,2,…..n

f (x, n, p) = (n! / (n-x)!x! ) * P^x * (1-P)^ n-x

x= number of successes

n = total number of events

p = probability of success

Probability mass function — PMF

The probability of getting exactly x successes out of n events is called a probability mass function

b(x, n, P) = nCk * P^x * (1 — P)^n-x

e.g. probability of getting 4heads out of 10 tosses.

b(5,10,0.5) = 10!/6!4! * (0.5)⁴* (0.5)⁶

Cumulative density function — CDF

The probability of getting x and less than x successes i.e. atmost x successes of n events.

e.g. probability of getting at most 3 heads out of 10 tosses.

cdf(3,10,0.5) = b(0,10,0.5)+b(1,10,0.5)+b(2,10,0.5)+b(3,10,0.5)

Survival function — SF

The probability of getting minimum x successes i.e. at least x or x and greater than x of n events.

e.g. probability of getting atleast 8 heads out of 10 tosses.

sf(8,10,0.5) = b(8,10,0.5)+b(9,10,0.5)+b(10,10,0.5)

POISSON DISTRIBUTION

The probability of event occurring within a time interval. For instance, number of asthma attacks, waiting time in queues.

  • No finite number of events
  • No constant probability of event happening

p(x, λ) = (e−**λ)* (λ**x) / x! for x=0,1,2,⋯

λ = average number of successes = event rate

x= number of successes we expect

e = 2.71828

Probability mass function

The successes that result from an Poisson experiment.

e.g. probability of selling 3 ice creams tomorrow given that average ice creams sold in a day are 2

λ = 2

x=3

e= 2.71828

P(x, μ) = (e^-μ) (μx) / x!

P(3, 2) = (2.71828**-2) (2**3) / 3! = 0.180

Cumulative Density Function

A cumulative Poisson probability refers to the probability that the Poisson random variable is lesser than some specified limit

e.g. Suppose the average number of lions seen on a 1-day safari is 5. What is the probability that tourists will see fewer than four lions on the next 1-day safari?

mu = lambda=5

x=0,1,2,3 since x<4

e=2.71828

P(x < 3, 5) = P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5)

P(x < 3, 5) = 0.2650

Survival Function

A cumulative Poisson probability refers to the probability that the Poisson random variable is greater than some specified limit.

e.g. Suppose the average number of lions seen on a 1-day safari is 5. What is the probability that tourists will see more than three lions on the next 1-day safari?

mu = lambda=5

x=4,5 since x>3

e=2.71828

P(x >= 3, 5) = P(4, 5) + P(5, 5) = 1- cdf(3,5)

NORMAL DISTRIBUTION

Also called as GAUSSIAN distribution or bell curve distribution.

The distribution which is symmetric about mean. The data near the mean are more frequent than those any from the mean.

It forms a bell curve when plotted.

In reality, most of the distributions are not perfectly normal

f(x) = e** -(x**2) / √2π

e = 2.71828

π=3.14

--

--