# Basic statistics in Data science

The term “Data science” comes in 1996 which was included in the title of a statistical conference (International Federation of Classification Societies (IFCS). In early 1997, there was an even more radical view suggesting to rename statistics to Data Science. Recently, “Donoho” provides an overview of Data Science which focuses on the evolution of Data Science from statistics. Statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty.

One of the most comprehensive definitions of Data Science was recently given by Cao as the formula:

*data science = (statistics + informatics + computing + communication + sociology + management) | (data + environment + thinking).*

So now you know how statistics is related to data science and how important it is , lets dive into basic statistics that will be very helpful in you data science career .

# Overview

Statistics is the science that deals with methodologies to gather, review, analyze, and draw conclusions from data. With specific Statistics tools in hand, we can derive many key observations and make predictions from the data in hand. In Real-world we deal with many cases where we use Statistics knowingly or unknowingly.

Let’s talk one such classic use of statistics in the most famous sports of India, yes you guessed it right, Cricket. What makes Virat Kohli the best batsman in ODIs or Jaspreet Bumrah the best bowler in ODIs? We all have heard about cricketing terms like batting average, bowler’s economy, strike rate etc. We often see a graph like these:

Here by using different statistical methods ICC compare players, teams and rank them. So, if we learn the science behind it we can create our own rankings, compare players, teams or better if we debate with someone over who is the better player, we can debate now with facts and figures because we will understand the statistics behind it better. We can understand the above graphs better.

# Type of Statistics

# Descriptive Statistics:

The type of statistics dealing with numbers (numerical facts, figures, or information) to describe any phenomena. These numbers are descriptive statistics. e.g. Reports of industry production, cricket batting averages, government deficits, Movie Ratings etc.

# Inferential statistics

Inferential statistics is a decision, estimate, prediction, or generalization about a population, based on a sample. A **population** is a collection of all possible individual, objects, or measurements of interest. A **sample** is a portion, or part, of the population of interest. Inferential statistics is used to make inferences from data whereas descriptive statistics simply describe what’s going on in our data.

# Measures of Central Tendency

A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. These measures indicate where most values in a distribution fall and are also referred to as the central location of a distribution. You can think of it as the tendency of data to cluster around a middle value. In statistics, the three most common measures of central tendency are the mean, median, and mode. Each of these measures calculates the location of the central point using a different method.

**Mean**: The mean is the arithmetic average, for calculating the mean just add up all of the values and divide by the number of observations in your dataset.

**Median**: The median is the middle value. It is the value that splits the dataset in half. To find the median, order your data from smallest to largest, and then find the data point that has an equal amount of values above it and below it. The method for locating the median varies slightly depending on whether your dataset has an even or odd number of values.

**Mode**: The mode is the value that occurs the most frequently in your data set i.e. has the highest frequency. On a bar chart, the mode is the highest bar. If the data have multiple values that are tied for occurring the most frequently, you have a multimodal distribution. If no value repeats, the data do not have a mode.

# SKEWNESS EFFECTS AND USES OF CENTRAL TENDENCIES

## What is Skewness

Skewness is an asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right. Skewness can be quantified to define the extent to which a distribution differs from a normal distribution.

In a **normal distribution**, the graph appears as a classical, symmetrical “bell-shaped curve.” The mean, or average, and the mode, or maximum point on the curve, are equal. In a perfect normal distribution, the tails on either side of the curve are exact mirror images of each other.

When a distribution is skewed to the left, the tail on the curve’s left-hand side is longer than the tail on the right-hand side, and the mean is less than the mode. This situation is also called **negative skewness**.

When a distribution is skewed to the right, the tail on the curve’s right-hand side is longer than the tail on the left-hand side, and the mean is greater than the mode. This situation is also called **positive skewness**.

# Measures of Dispersion

The measure of dispersion shows the scatterings of the data. It tells the variation of the data from one another and gives a clear idea about the distribution of the data. The measure of dispersion shows the homogeneity or the heterogeneity of the distribution of the observations.

**1.Range**: A range is the most common and easily understandable measure of dispersion. It is the difference between two extreme observations of the data set. If X max and X min are the two extreme observations then

Range = X max — X min

Since it is based on two extreme observations so it gets affected by fluctuations.Thus, range is not a reliable measure of dispersion

**2.Standard Deviation: **In statistics, the standard deviation is a very common measure of dispersion. Standard deviation measures how spread out the values in a data set are around the mean. More precisely, it is a measure of the average distance between the values of the data in the set and the mean. If the data values are all similar, then the standard deviation will be low (closer to zero). If the data values are highly variable, then the standard variation is high (further from zero).

The standard deviation is always a positive number and is always measured in the same units as the original data. Squaring the deviations overcomes the drawback of ignoring signs in mean deviations i.e. distance of points from mean must always be positive.

**3. Variance: **The Variance is defined as the average of the squared differences from the Mean.

# Coefficient of Variation (CV)

The coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage, and is defined as the ratio of the standard deviation(σ) to the mean(μ). It gives the measure of variability

**CV = Standard Deviation / Mean**

# Covariance and Correlation

## Covariance

It is a method to find the variance between two variables.

1. It is the relationship between a pair of random variables where change in one variable causes change in another variable.

2. It can take any value between -infinity to +infinity, where the negative value represents the negative relationship whereas a positive value represents the positive relationship.

3. It is used for the linear relationship between variables.

4. It gives the direction of relationship between variables.

5. It has dimensions.

**Covariance Relationship**

# Correlation

* It shows whether and how strongly pairs of variables are related to each other.

* Correlation takes values between -1 to +1, wherein values close to +1 represents strong positive

* correlation and values close to -1 represents strong negative correlation.

* In this variable are indirectly related to each other.

* It gives the direction and strength of relationship between variables.

* It is the scaled version of Covariance.

- It is dimensionless.

**Correlation Relationship**

Positive Correlation

When the values of variables deviate in the same direction i.e. when value of one variable increases(decreases) then value of other variable also increases(decreases).

Examples:

1) Height and weight of persons

2) Amount of rainfall and crops yield

3) Income and Expenditure of Households

4) speed of a wind turbine, the amount of electricity that is generated

5) The more years of education you complete, the higher your earning potential will be

6) As the temperature goes up, ice cream sales also go up

7) The more it rains, the more sales for umbrellas go up

Negative Correlation

When the values of variables deviate in the opposite direction i.e. when value of one variable increases(decreases) then value of other variable also decreases(increases).

Examples:

1) Price and demand of goods

2) Poverty and literacy

3) Sine function and cosine function

4) If a train increases speed, the length of time to get to the final point decreases

5) The more one works out at the gym, the less body fat one may have

6) As the temperature decreases, sale of heaters increases

Zero Correlation

When two variables are independent of each other, they will have a zero correlation.

**Note: — When data is scaled covariance and correlation will give the same value. Also, correlation and Causality are not the same thing.**