Mathematics & Statistics

Part — 1

Shreyal Gajare
Omni Data Science
6 min readJun 23, 2019

--

Mathematics deals with the complex calculations & computations necessary for accurate analysis. Statistics is an areas of mathematics that is globally accepted as a prerequisite for deeper understanding of Data Science. It deals with the science behind data. Thus, it helps in finding appropriate methods for data collection, employing the correct analyses and presentation of effective outcomes.

First we will focus only on statistics. Statistics is not just the numbers, facts or evidences. It provides the guidance for analysis or predictions thus avoiding the respective errors and traps. It is not necessary to be a master in statistics but basic theories should be clear to you. Let us start with this exciting tutorial of STATISTICS.

There are two major branches of statistics namely Descriptive Statistics & Inferential Statistics. Both these are implemented in statistical analysis of data and are equally important.

A. Descriptive Statistics:

Descriptive statistics mainly deals with the collection & presentation of data. Thus you can understand that, this is the initial part of statistical analysis. It is not easy as it sounds to be. Here the data needs to be describe, present, summarize and organize through numerical computations, tabular formats or via graphical methods.

1. Difference between a Population and a Sample:

Population is the collection of all the items of interest in our study. Denoted by “N”. The numbers obtained during the use of population are called parameters.

Sample is a subset of Population. Denoted by “n”. The numbers obtained during working with a sample are called statistics.

2. Types of Data:

Categorical Data — Categorical data consists of categories and groups which may not have any logical order. E.g. Yes/No questions

Numerical Data — Numerical data consists of numbers being a measurement such as person’s weight, age or height. It further has two sub parts namely Continuous data and Discrete data.

Continuous Data — Continuous data is infinite and impossible to count, but can be described using intervals on real number line. E.g. [0–20], [5–10] etc.

Discrete Data — Discrete data can be in finite manner and possibly be listed out. E.g. 0, 1, 2…

2.1 Visualization techniques for Categorical Data:

  1. Frequency Distribution Table (FDT)
  2. Bar Charts
  3. Pie Charts
  4. Pareto diagrams
  5. Cross Tables

2.2 Visualization techniques for Numerical Data:

  1. Frequency Distribution Table
  2. Histograms
  3. Scatterplots

3. Levels of Measurement:

Level of measurement is a classification which describes the nature of data.

Nominal — These are the numbers but cannot have any definite order. E.g car brands like Audi, BMW, Toyota etc.

Ordinal — Follows strict definite order. E.g. Good, better, best.

Interval — Does not have True 0 value i.e interval allows the degree of difference within the items. E.g.Temperature on Celsius scale has normally two points (boiling and freezing) and then divided into 100 parts.

Ratio — Has True (meaningful) 0 value. E.g. length, mass, angle etc.

4. Measures of Central Tendency:

A measure of central tendency is a statistic orientation that describes the central point of distribution in the available data. The three most common measures of central tendency are mean, mode and median.

Mean (a.k.a Simple Average): It is a summation of all the samples/observations divided by the total sample size. It is better to use this measure when we have symmetrical distribution. The only disadvantage of mean is easily affected by outliers.

Mode: The mode is the value that occurs most often. It is used for numerical & categorical data. If neither of the value is repeated the data has no mode.

Median: It is the middle number in an ordered dataset. The median is the number at position (n+1)/2 in the ordered list, where n is the number of observations. For odd number of observations exact middle value is the median and for even numbers average of two central elements is the median.

5. Measures of Asymmetry:

Skewness is the common measure of asymmetry which indicates the direction in which the observations in dataset are concentrated. Measures of symmetry like skewness are the link between central tendency measures & probability theory. The different types of skewness are given below:

In Positive skew data points are concentrated on left side and the tail is leading towards right side. It is also known as right skew, which indicates that the outliers are to the right. In this scenario mean>median and mode is the value with highest visual representation.

In Negative skew data points are concentrated on right side and the tails is leading towards left side. It is also called as left skew, indicating the outliers lie to the left hand side. Here, mean<median and mode defines the highest point.

If mean=median=mode then it is known as Zero skew or no skew. This is also known as symmetrical distribution.

Types of Skewness

6. Measures of Variability:

Variance: Variance measures the dispersion of a set of data points around their mean. For population variance,

For sample variance,

Standard Deviation: The standard deviation is the square root of variance. Standard deviation is the most common measure of variability for a single dataset.

Coefficient of Variation(CV): (a.k.a Relative Standard Deviation). Coefficient of variation is used for comparing two or more datasets.

The population and sample formula are,

7. Covariance:

The two variables are correlated and the main statistic to measure this correlation is called covariance. Covariance may be positive, zero or negative.

Sample formula
Population formula

Covariance gives a sense of direction:

a) The two variables move together (same direction) then it is positive (+)

b) The two variables move in opposite directions then it is negative (-)

c) The two variables are independent if it is equal to 0

8. Correlation Coefficient:

Correlation adjusts covariance, so that the relationship between two variables becomes easy and intuitive to interpret.

If correlation coefficient = 1 means perfect positive correlation i.e. entire variability of one variable is explained by the other variable. Correlation coefficient = -1 implies perfect negative correlation. Imperfect negative correlation lies within the range of [-1,0].

Guys, I hope you have understood Part 1 statistics very well. Just practice it well and stay tuned with Omni Data Science.

--

--