This is the 3rd post of blog post series ‘Probability & Statistics for Data Science’, this post covers these topics related to descriptive statistics and their significance in data science.
- Introduction to Statistics
- Descriptive Statistics
- Uni-variate Analysis
- Bi-variate Analysis
- Multivariate Analysis
- Function Models
- Significance in Data Science
Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data.
Statistics, in short, is the study of data. It includes descriptive statistics (the study of methods and tools for collecting data, and mathematical models to describe and interpret data) and inferential statistics (the systems and techniques for making probability-based decisions and accurate predictions.
Population vs Sample
Population means the aggregate of all elements under study having one or more common characteristic while sample is a part of population chosen at random for participation in the study.
A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of a collection of information. Descriptive statistics are just descriptive. They do not involve generalizing beyond the data at hand.
Types of Variable
Dependent and Independent Variables: An independent variable (experimental or predictor) is a variable that is being manipulated in an experiment in order to observe the effect on a dependent variable (outcome).
Categorical and Continuous Variables: Categorical variables (qualitative) represent types of data which may be divided into groups. Categorical variables can be further categorized as either nominal, ordinal or dichotomous. Continuous variables (quantitative) can take any value. Continuous variables can be further categorized as either interval or ratio variables.
Central tendency is a central or typical value for a distribution.It may also be called a center or location of the distribution. The most common measures of central tendency are the arithmetic mean, the median and the mode.
Mean is the numerical average of all values, median is directly in the
middle of the data set while mode is the most frequent value in the data set.
Spread or Variance
Spread (dispersion or variability) is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and inter-quartile range (IQR).
Inter-quartile range (IQR) is the distance between the 1st quartile and 3rd quartile and gives us the range of the middle 50% of our data. Variance is the average of the squared differences from the mean while standard deviation is the square root of the variance.
Upper outliers: Q3+1.5 ·IQR
Lower outliers: Q1–1.5 ·IQR
Standard Score or Z score: For an observed value x, the Z score finds the number of standard deviations x is away from the mean.
The Central Limit Theorem is used to help us understand the following facts regardless of whether the population distribution is normal or not:
1. the mean of the sample means is the same as the population mean
2. the standard deviation of the sample means is always equal to the standard error.
3. the distribution of sample means will become increasingly more normal as the sample size increases.
In uni-variate analysis, appropriate statistic depends on the level of measurement. For nominal variables, a frequency table and a listing of the mode(s) is sufficient. For ordinal variables the median can be calculated as a measure of central tendency and the range (and variations of it) as a measure of dispersion. For interval level variables, the arithmetic mean (average) and standard deviation are added to the toolbox and, for ratio level variables, we add the geometric mean and harmonic mean as measures of central tendency and the coefficient of variation as a measure of dispersion.
For interval and ratio level data, further descriptors include the variable’s skewness and kurtosis. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
Mainly, bar graphs, pie charts and histograms are used for uni-variate analysis.
Bivariate analysis involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.
For two continuous variables, a scatter-plot is a common graph. When one variable is categorical and the other continuous, a box-plot or violin-plot (also Z-test and t-test) is common and when both are categorical a mosaic plot is common (also chi-square test).
Multi-variate analysis involves observation and analysis of more than one statistical outcome variable at a time. Multi-variate scatter plot, grouped box-plot (or grouped violin-plot), heat-map are used for multi-variate analysis.
A function can be expressed as an equation, as shown below. In the equation, f represents the function name and x represents the independent variable and y represents the dependent variable.
A linear function has the same average rate of change on every interval. When a linear model is used to describe data, it assumes a constant rate of change.
Exponential functions have variable appears as the exponent (or power) instead of the base.
The logistic function has effect of limiting upper bound, a curve that grows exponentially at first and then slows down and hardly grows at all.
Significance in Data Science
Descriptive Statistics helps you to understand your data and is initial & very important step of Data Science. This is due to the fact that Data Science is all about making predictions and you can’t predict if you can’t understand the patterns in existing data.