Statistics 📊for Data Science

Published in

Analytics Vidhya

8 min readAug 12, 2020

Statistical knowledge helps you use the proper methods to collect the data, employ the correct analyses, and effectively present the results. Statistical concepts are used to derive meaningful insights from data by performing mathematical computations on it.

Why is it so important❓

Statistics is a crucial process behind how we make discoveries in science, make decisions based on data, and make predictions.

Types of Statistical methods

💠 Descriptive Statistics: Descriptive statistics, in short, help describe and understand the features of a specific data set by giving short summaries about the sample and measures of the data. A person analyzes the frequency of each data point in the distribution and describes it using the mean, median, or mode, which measures the most common patterns of the analyzed data set.

💠 Inferential Statistics: Inferential statistics use a random sample of data taken from a population to describe and make inferences about the population. Inferential statistics are valuable when examination of each member of an entire population is not convenient or possible.

If everything is going over your head🤯, don’t worry i got your back❗❗

Lets understand what is ‘population’ and ‘sample’ by using a real life usecase.

We all love Maggi 😋 don’t we, so let’s understand statistics using maggie 😄 .

On August 13, 2015 the Bombay High Court struck down the nationwide ban imposed on Nestlé Maggi instant noodles by FSSAI. The Court directed Nestlé to have fresh safety tests conducted on the product before bringing it back to the market. Nestlé was asked to provide samples of each variant of Maggi instant noodles for fresh test to three labs in Punjab, Hyderabad, and Jaipur. The High Court ruled that even after the fresh tests, if the lead content was found to be in excess of the permissible limit, then Nestlé would not be allowed to manufacture and sell Maggi noodles in India. The results of the fresh tests conducted at the three labs went in favor of Nestlé. As a Consequence, Nestlé India resumed selling Maggi noodles in the month of November 2015.

How do statistics come into picture in this case?

In order to test all the maggie packets which are widely spread across the world ( known as population ) for lead content, its really difficult to just open each packet and test, thus in this case what generally a data scientist or data analyst or a statistician does is, they will consider only a few packets from each state, let's say 500 packets ( known as sample) and they will test the lead content in each(using Z-Test or T-Test or P-Test )and will infer the data to relate it to the population using inferential statistics which is how the entire result is drawn and the case went in favor of Nestlé.

**A sample is a subset of the population**

Descriptive Statistics

Types of Data

Categorical data: Categorical data represents groups or categories, for example, car brands, answer to yes or no questions, etc.
Numerical data: Numerical data represent numbers. It is divided into two groups: discrete and continuous. Discrete data can be usually counted in a finite matter, while continuous is infinite and impossible to count, example for continuous would be weight, height, age, as we can count age like 25 years, 10 months, 2 days, 5 hours, 4 seconds, 4 milliseconds, 8 nanoseconds, 99 picoseconds…and so on and for the discrete counting the money in your bank account.

Levels of Measurement

Qualitative: There are two qualitative levels: nominal and ordinal. The nominal level represents categories that cannot be put in any order, while ordinal represents categories that can be ordered, example for nominal is the four seasons (winter, spring, summer, autumn) and for ordinal it will be the rating of your meal (disgusting, unappetizing, neutral, tasty, and delicious).
Quantitative: There are two quantitative levels, interval, and ratio. They both represent “numbers”, however, ratios have a true zero, while intervals don’t, example for the interval is degrees Celsius and Fahrenheit and for ratio its degrees Kelvin, length

Visualizations for Numerical and Categorical data

Visualization gives you answers to questions you didn’t know you had

Graphs and tables that are useful for representing categorical and numerical variables, two categorical variables, two numerical variables, one categorical and one numerical variable are Frequency distribution tables, Bar charts, Pie charts, Pareto Diagrams, Histograms, scatter plots, Boxplots, Correlation matrices.

Mean, Median, Mode

Mean: The mean is the most widely spread measure of central tendency. It is the simple average of the dataset, it's easily affected by outliers.

Median: The median is the midpoint of the ordered dataset. It is not as popular as the mean but is often used in academia and data science. That is since it is not affected by outliers.

Mode: The mode is the value that occurs most often. A dataset can have 0 modes, 1 mode, or multiple modes. The mode is calculated simply by finding the value with the highest frequency.

Inferential Statistics

In Inferential statistics, we mostly talk about distributions more precisely probability distributions.

A probability distribution is a mathematical function that, stated in simple terms, can be thought of as providing the probabilities of occurrence of different possible outcomes in an experiment, it’s a function that shows the possible values for a variable and how often they occur.

It is a common mistake to believe that the distribution is the graph. In fact, the distribution is the ‘rule’ that determines how values are positioned in relation to each other.

Very often, we use a graph to visualize the data. Since different distributions have a particular graphical representation.

Normal distribution

The Normal distribution is also known as Gaussian distribution or the Bell curve. It is one of the most common distributions due to the following reasons:

It approximates a wide variety of random variables.
Distributions of sample means with large enough sample sizes could be approximated to normal.
All computable statistics are elegant.
Heavily used in regression analysis.
Good track record.

Examples: height, length of arms, legs, nails, blood pressure, thickness of tree barks, IQ tests and Stock market information.

Formula

Keeping the standard deviation constant, the graph of a normal distribution with a smaller mean will be the same as the larger mean.

Keeping the mean constant, a normal distribution with a smaller standard deviation would be situated in the same spot, but have a higher peak and thinner tails, with a larger standard deviation would be situated in the same spot, but have a lower peak and fatter tails.

Standard Normal Distibution

The Standard Normal distribution is a particular case of the Normal distribution. It has a mean of 0 and a standard deviation of 1.

Every Normal distribution can be ‘standardized’ using the standardization formula: Z=X-𝜇/𝜎 where 𝜇(mu) is mean and 𝜎(sigma) is the standard deviation.

what is the difference between Normal distribution and Standard Normal distribution?

Standardization allows us to :

Compare different normally distributed datasets detect normality
Detect outliers
Create confidence intervals
Test hypotheses
Perform regression analysis

We want to transform a random variable from 𝑁~ (μ,𝜎2 ) to 𝑁~(0,1). Subtracting the mean from all observations would cause a transformation from 𝑁~ (μ,𝜎2) to 𝑁~ (0,𝜎2), moving the graph to the origin. Subsequently, dividing all observations by the standard deviation would cause a transformation from 𝑁~ (0,𝜎2) to 𝑁~ (0,1), standardizing the peak and the tails of the graph.

Central Limit Theorem

The Central Limit Theorem (CLT) is one of the greatest statistical insights. It states that no matter the underlying distribution of the dataset, the sampling distribution of the means would approximate a normal distribution. Moreover, the mean of the sampling distribution would be equal to the mean of the original distribution and the variance would be n times smaller, where n is the size of the samples. The CLT applies whenever we have a sum or an average of many variables(e.g. sum of rolled numbers when rolling dice).

Why is it useful?

The CLT allows us to assume normality for many different variables. That is very useful for confidence intervals, hypothesis testing, and regression analysis. In fact, the Normal distribution is so predominantly observed around us due to the fact that following the CLT, many variables converge to Normal.

The more samples, the closer to Normal ( k -> ∞ ) and the bigger the samples, the closer to Normal ( n -> ∞ ).

Where can we see it?

Since many concepts and events are a sum or an average of different effects, CLT applies and we observe normality all the time. For example, in regression analysis, the dependent variable is explained through the sum of error terms.

Confidence Intervals and the Margin of Error

Margin of error and confidence intervals makes it easy to relate the sampling distribution mean to the populations mean.

A confidence interval is an interval within which we are confident (with a certain percentage of confidence)the population parameter will fall. We build a confidence interval around the point estimate. (1- α) is the level of confidence. We are(1-α)*100% confident that the population parameter will fall in the specified interval.Common alphas are:0.01,0.05,0.1.

[ 𝒙-ME, 𝒙+ME], where ME is the margin of error, and x is mean.

Hypothesis testing

a claim or an assumption that you make about one or more population parameters

Types of Hypothesis

Null hypothesis (H₀): Makes an assumption about the status quo always contains the symbols ‘=’, ‘≤’ or ‘≥’.
Alternate hypothesis (H₁): Challenges and complements the null hypothesis, always contains the symbols ‘≠’, ‘<’ or ‘>’

A/B testing and Click-through rate

A/B testing is a direct industry application of the two-sample proportion test sample.

While developing an e-commerce website, there could be different opinions about the choices of various elements, such as the shape of buttons, the text on the call-to-action buttons, the colour of various UI elements, the copy on the website, or numerous other such things.

Often, the choice of these elements is very subjective, and is difficult to predict which option would perform better. To resolve such conflicts, you can use A/B testing. A/B testing provides a way for you to test two different versions of the same element and see which one performs better.

you can see a few more case studies and applications of A/B testing in the real world here.

The two-sample proportion test is used when you want to compare the proportions of two different samples.

For instance, do you know that there are algorithms that use A/B Testing for setting the thumbnails for your youtube videos, even Netflix, youtube, Hotstar, etc recommendations work on these algorithms?

Click through rate: The proportion of visitors to a web page who follow a hypertext link to a particular site.

If you have enjoyed my writing then do give a clap 👏 as it motivates me to write more.

Manish Kumar Thota

Data Science Enthusiast 👨‍💻

Keep coding !!