Statistics & Probability for Data Science First Part

Avinaba Mukherjee
7 min readJul 8, 2022

--

pic: www.freepik.com

Probability and Statistics are the base of Data Science. The probability theory is extremely helpful for making the prediction. Estimates and predictions form an essential part of Data science. With the help out of statistical methods, we create estimates for the advance analysis. Thus, statistical methods are mainly dependent on the theory of probability. Statistics and Probability is dependent on Data.

Data

Data is the collection of information (observations) we have about amazing facts and statistics collected together for analysis.

Data — a collection of evidence (words, numbers, observations, measurements, etc) that has been translated into a form that computers be able to process

Why does Data issue?

· Helps in perceptive more concerning the data by identifying associations that may exist between 2 variables.

· Helps in predicting the opportunity or forecast based on the before trend of data.

· Helps in shaping patterns that may live between data.

· Helps in detecting fraud or scam by display anomalies in the data.

Data matters a grouping today as we can conclude significant information from it. Now let’s look into how data is characterized. Data can be of two (2) types categorical (Example — Marital Status, regions, occupation class, gender) and numerical (Example — Age and balance, credit score, age, tenure months)

Note: Categorical Data can be visualized by Pareto Chart Pie Chart, Bar Plot, Numerical Data can be visualized by Histogram, Line Plot, Histogram, Scatter Plot.

Descriptive Statistics

A descriptive statistic is a abstract statistic that summarizes features or quantitatively describes of a collection of information. It helps us in meaningful our data better. It is used to explain the quality of data.

Dimension level of Data

The qualitative and quantitative data is very much like to the above categorical and numerical data.

Nominal: — Data at this level is measured using names, labels or qualities. Example: — Zip Code, Gender, Brand Name.

Ordinal: -Data can be arranged in order or ranked and can be compared. Example: — Date, Star Reviews, Grades, Position in Race.

Interval: Data at this level can be well-ordered as it is in a range of values and significant differences between the data points can be calculated. Example: — Temperature in Celsius, Year of Birth.

Ratio: Data at this level is like to interval level with new property of an inbuilt zero. Numerical calculation can be performed on these data point. Example: — Height, Age, Weight

Population or Sample Data

Before performing any study of data, we should resolve if the data we are dealing with is population or sample.

Population: — Collection of all items (N) and it includes all and every unit of our study. It is unbreakable to define and the measure of quality such as mean, mode is called parameter.

Sample: — division of the population (n) and it includes only a handful units of the population. It is selected at random and the measure of the quality or class is called as statistics.

Now before looking at distributions of data. Let’s take a seem at measures of data.

Measures of Central Tendency

The calculate of central tendency is a solo value that attempts to explain a set of data by identifying the central position or point within that set of data. As such, method of central tendency is sometimes called measures of central location. They are as well classed as summary statistics.

Mean: -The mean is like to the sum of all the values in the data set divided by the number of values in the data set i.e the calculate average. It susceptible to outliers when odd values are added it gets skewed i.e., deviates from the classic central value.

Formula of Mean

Median: — Median is the middle value for a dataset that has been arranged in order of magnitude. Median is a superior substitute to mean as it is less artificial by outliers and skewness of the data. The median value is greatly closer than the representative central value.

If total numbers of values are odd
If total numbers of values are even

Mode: — The mode is the most normally happening value in the dataset. The mode can, so sometimes believe the Mode as being the most popular choice.

For Example, in a dataset containing (5,8,6,4,7,1,2,4,5,7,5,8,2,,5,6,8) values.

www.bing.com

Measures of Asymmetry

Skewness: — Skewness is the irregularity in a statistical distribution, in which the curve appears deformed or skewed towards to the left or to the right. Skewness indicates whether the data is strong on one side.

Positive Skewness: -Positive(+Ve) Skewness is when the mode>median>mean. The tail is skewed to the right. The outliers are right skewed.

Negative Skewness: — Negative(-Ve) Skewness is when the mean<median<mode. The tail is skewed to the left. The outliers are skewed to the left.

Skewness is vital as it tells us about where the data is distributed.

Measures of Variability (Dispersion)

The measure of central tendency gives an only value that represents the entire value; however, the central tendency cannot explain the observation fully. The measure of dispersion helps us to study the variability of the items that is the spread of data.

Remember: — Population Data have N data points and Sample Data has (n-1) data points. (n-1) is called Bessel’s Correction and it is used to diminish bias.

Range: -The difference or dissimilarity between the largest and the smallest value of a data, is termed as the range of the distribution. Range does not believe all the values of a series, i.e., it takes only the great items and middle items are not measured significant. Example: — For (5,8,6,4,7,1,2,4,5,7,5,8,2,,5,6,8) the range is 7that is (8–1).

Variance: — Variance measures how faraway is the sum of squared distances from every point to the mean that is the spreading around the mean.

Variance is the mean of all squared deviations.

www.being.com
Variance

Note: -The units of values and variance is not equivalent so we use another variability measure.

Standard Deviation: — As Variance suffers from unit disparity so standard deviation is used. Standard deviation is square root of the variance. It tells us about the application of the data around the mean of the data set.

Standard deviation

Coefficient of Variation (CV): — It is also termed as the relative standard deviation. It is the ratio of standard deviation to s mean of the dataset.

Coefficient of Variation (CV)

The variability of an only dataset is Standard deviation. Whereas the coefficient of variance can be used for comparing two datasets.

Measures of Quartiles-

Quartiles are improved at understanding as every data point considered.

Measures of Relationship: -

Measures of relationship are used to find the association between two variables.

Covariance: — Covariance is a measure of the relationship between the variability of 2 variables that is It measures the degree of change in the variables, when one variable changes, will there be the same/similar change in the other variable. Covariance does not give efficient information about the relation between 2 variables as it is not normalized.

Covariance formula

Correlation: -Correlation gives a superior understanding of covariance. It is normalized covariance. How the variables are correlated to each other is measured Correlation. It is called as Pearson Correlation Coefficient.

Correlation formula

The value of correlation is from +1 to -1. -1 (minus one) indicates negative correlation i.e., with an increase in 1 variable independent there is a decrease in the other dependent variable. 1 (One) indicates positive correlation i.e., with an increase in 1 variable independent there is an increase in the other dependent variable. 0 (Zero) indicates that the variables are independent of each former.

Best Data Science Course:

Best Data Analytics Course

Other Blogs:

Link

https://www.handsonsystem.com/blogs

--

--