Basics of Statistics for Data Science…

Vishal Shelar
9 min readMay 24, 2023

--

Table of contents :-

What is Statistics?

Types of statistics

Sample data and population data

Types of Data

Scales of measurement of data

Measure of central tendency and dispersion

Random variable

Covariance and Correlation

Skewness

What is Statistics?

Statistics is the science of collecting, organizing and analyzing the data.

data :- Data is a facts or piece of information.

ex 1) :- Heights of the students in the class. {175cm, 180cm, 190cm, …}

ex 2) :- IQ of the student. {85, 90, 100, 95, …}

Types of Statistics :-

  1. Descriptive statistics
  2. Inferential statistics

Descriptive Statistics :-

  • It consists of organizing and summarizing the data.
  • Techniques(concept) to retrieve the information from the data:-
  • 1) measure of central tendency :- mean, median, mode.
  • 2) measure of dispersion :- variance, standard deviation.
  • Example :- Let say there are 50 students in a math class in the university. we have collected the height of the students in the class.{175cm, 180cm, 160cm, 140cm, 130cm, 140cm, 140cm, ……}
  • Following are the descriptive questions can be asked :-
  • 1) what is the average height of the students in the class?
  • 2) what is the common height of the students?

Inferential statistics : -

  • It consists of using data you have measured to form conclusion.
  • All the time you will not get all the data. So it consists of with the help of sample dataset you make a conclusions with different experiments about population data.
  • In this we use experiments such as z-test, t-test, etc. and techniques such as hypothesis testing, P value, Significance value.
  • Example :- Let say there are 50 students in a math class in the university. we have collected the height of the students in the class.{175cm, 180cm, 160cm, 140cm, 130cm, 140cm, 140cm, ……}
  • Following are the Inferential questions can be asked :-
  • 1) Are the average height of the students in the classroom similar to what you expect in the entire college?

Sample Data and Population data :-

  1. A population is the entire group that you want to draw conclusions about.
  2. A sample is the specific group that you will collect data from population
  3. The size of the sample is always less than the total size of the population. Quantitative data refers to numerical information

Types of data :-

There are two types of data :- 1) Quantitative data. 2) Qualitative data.

  1. Quantitative data :-
  • Quantitative data refers to numerical information.
  • It further divided into two types :- 1) Discrete. 2) Continuous
  • Discrete :- Discrete data refers to information that is represented by whole numbers and cannot be divided into smaller parts.
  • example :- no. of bank accounts, number of members in a family.
  • Continuous :-Continuous data refers to information that can take on any value within a specific range, allowing for infinite possible values between any two data points.
  • example :- Weight, Height, Temperature, Speed.

2. Qualitative data :-

  • Qualitative data refers to non-numerical information that is descriptive in nature and expressed in words or categories.
  • It further divided into two types :- 1) Nominal. 2) Ordinal.
  • Nominal :- Nominal data refers to information that is categorized into distinct groups or labels without any order.
  • example :- Gender- male or female , Blood Group, color.
  • Ordinal :-Ordinal data refers to categorical data where variables have a specific order or ranking assigned to them.
  • example :- rating scale for customer satisfaction- “Very Satisfied,” “Satisfied,” “Neutral,” “Dissatisfied,” or “Very Dissatisfied.”

Scales of measurement of data :-

There are four scales of measurement of data.

  1. Nominal Scale data
  2. Ordinal Scale data
  3. Interval Scale data
  4. Ratio Scale data

Nominal Scale data :-

  • Nominal scale data refers to a type of categorical data where we assign labels or names to different categories or groups
  • example:- Colors. Imagine we have a group of fruits, and we want to categorize them based on their colors. We can assign different labels to the categories: red, green, and yellow.
  • In nominal scale data order does not matter.

2) Ordinal Scale data :-

  • Ordinal scale data is a type of categorical data where we not only categorize items into different groups but also assign an order or ranking to those categories
  • Example :- rating customer satisfaction for a product on a scale of 1 to 5. Here, the categories are “1 star,” “2 stars,” “3 stars,” “4 stars,” and “5 stars.”
  • In ordinal Scale data order and ranking matters!

3) Interval Scale data :-

  • Interval scale data is a type of quantitative data that not only categorizes items but also allows for meaningful numerical differences between the categories.
  • Example :- temperature measured in degrees Celsius or Fahrenheit. In this case, the numerical values assigned to temperature represent meaningful differences between the measurements. The difference between 10°C and 20°C is the same as the difference between 30°C and 40°C.
  • In interval Scale data order and ranking matters.
  • It does not have a “0” as starting value.

4) Ratio Scale data :-

  • Ratio scale data is a type of quantitative data that not only categorizes items and allows for meaningful numerical differences between the categories
  • Example :- weight measured in kilograms. In this case, the numerical values assigned to weight not only represent meaningful differences between measurements but also have a true zero point. A weight of 0 kilograms indicates the absence of weight or mass.
  • In Ratio Scale data order and ranking matters.
  • It does have a “0” as starting value.

Measure of Central Tendency :-

Measure of central tendency talks about central region where maximum amount of data is present

Three measures of central tendency :-

  1. Mean
  2. Median
  3. Mode

Mean :-

The mean is the average value obtained by summing all the numbers in a set and dividing by the total count of numbers.

Lets understand the mean with the terms : Population and Sample. Consider a variable X having set of values X:{1,1,2,2,3,3,4,5,5,6}

Population mean (μ):-

Formula :-

μ = (1+1+2+2+3+3+4+5+5+6)/10 = 32/10 = 3.2

Sample mean(S) :-

Formula :-

Median :-

The median is the middle value in a sorted set of numbers or the average of the two middle values if there is an even number of values.

formula :-

if n is odd then,

If n is even then,

Mode :-

The value or values that appear most frequently in a dataset.

Where mode is used?

Ans :- suppose we have dataset of flower having features like “type of a flower” and “age”.

In “Type of a flower feature” we have one missing value, we can replace it by the mode of “Type of flower ” feature.

In case of age, if there is outlier present then we can replace it with the median else replace it with mean.

Calculating mean, median and mode using python :-

# Mean
import numpy as np
weights=[45,34,55,76,45,35,89,98,75]
np.mean(weights)

Output :- 61.333333333333336

# Median
import numpy as np
weights=[45,34,55,76,45,35,89,98,75]
np.median(weights)

Output :- 65.0

# Mode 
from scipy import stats
weights=[45,34,55,76,45,35,89,98,75]
stats.mode(weights)

Output :- ModeResult(mode=array([45]), count=array([2]))

Measure of dispersion :-

Measure of dispersion talks about the spread in a dataset.

  1. Variance .
  2. Standard Deviation.

Variance :-

Variance is a statistical measure that quantifies the average squared deviation of data points from their mean.

Variance talks about the spread of the data.

As the variance decreases the spread of data is decreasing
As the variance increases the spread of data is increasing

Lets understand the mean with the terms : Population and Sample.

Population Variance(σ²) :-

Formula :-

Where, Xi = Data Points, μ = Population mean, N = Population size

Sample Variance(S²) :-

Formula :-

where, Xi = Data Points, x bar = sample mean, n = sample size

In sample Variance why we specifically divide by (n-1)? → It is used to create unbiased estimator of the of the population variance.

Standard Deviation :-

Standard deviation is a measure that indicates the average amount by which data points deviate from the mean.

Population standard deviation(σ) :-

Formula :-

Sample Standard Deviation(S) :-

Formula :-

Calculating Variance and Standard Deviation using python :-

# varience using numpy
import numpy as np
ages_lst=[23,43,23,56,74,32,68,98,45,32]
var=np.var(ages_lst)
var

Output :- 541.64

# Varience using pandas
import pandas as pd
data=[[10,12,13],[34,23,45],[32,34,21]]
df=pd.DataFrame(data,columns=["A","B","C"])
df

Output :-

## Row wise
df.var(axis=1)

Output :-

## coulmnwise
df.var(axis=0)

Output :-

import numpy as np
ages_lst=[23,43,23,56,74,32,68,98,45,32]
std=np.std(ages_lst)
std

Output :- 23.273160507331188

Random Variables :-

Random variable is a process of mapping the output of a random process or experiment to a number.

Example :- Tossing a coin. X = { 0 if H ; 1 if T

Covariance and Correlation :-

Covariance is a statistical measure that quantifies the relationship between two variables, indicating how they vary together.

Formula :-

Advantage :- we can find the relationship between two X and Y. it can either positive or negative.

Disadvantage :- There is no such limit for covariance value due to this we can’t come to the conclusion that variables are highly correlated or not. In order to overcome this, we use another kind of correlation technique:-

Technique 1 :- Pearson Correlation Coefficient

In this technique all the values are ranging from -1 to 1

The more the value towards +1, the more positive the correlated it is.

The more the value towards -1, the more negative the correlated it is.

Formula :-

Technique 2 :-Spearman Rank Correlation :-

It is better than Pearson Correlation Coefficient

Formula :-

Skewness :-

Skewness is a measure of the lack of symmetry in the distribution of a dataset.

Symmetric(Not skewed) :-

The mean, median and mode are all perfectly at the center.

(mean = median = mode)

Q3-Q2 =Q2-Q1

No Skewed

Right Skewed() :-

mean ≥ median ≥ mode

Q3-Q2 ≥ Q2-Q1

Positive Skewed

Left Skewed :-

mean ≤ median ≤ mode

Q3-Q2 ≤ Q2-Q1

Negative skewed

Thank you for joining me on this statistical adventure, and I hope to travel with you on similar data-driven journeys in the future.

--

--