Descriptive and Inferential Statistics

Aditi Kanungo
Analytics Vidhya
Published in
4 min readFeb 9, 2021

Insight is gained by analyzing data and information in order to understand the context of a particular situation and draw conclusions. Those conclusions lead to actions you can apply to your business.

Statistics is concerned with developing and studying different methods for collecting, analyzing and presenting the empirical data.

Descriptive statistics and inferential statistics are two broad categories in the field of statistics. Descriptive Statistics describes data (for example, a chart or graph) and inferential statistics allows you to make predictions (“inferences”) from that data. With inferential statistics, you take data from sample and make generalizations about a population.

Descriptive Statistics

Descriptive statistics are very important because if we simply presented our raw data it would be hard to visualize what the data was showing, especially if there was a lot of it. Descriptive statistics therefore enables us to present the data in a more meaningful way, which allows simpler interpretation of the data. For example if we have store data we may be interested to know the most profitable product or overall profit of the store.

Approaches of descriptive statistics can be distinguished by the number of attributes (one-dimensional, multidimensional) as well as by the type of data (quantitative and qualitative).

We will look into following variants:

  • one-dimensional tasks with numerical data.
  • one-dimensional tasks with categorical data.
  • one-dimensional tasks with mixed data.

Numerical data descriptive statistics:

We will consider the following set of characteristics/tasks as the basic descriptive statistics for one-dimensional data array:

  • mean value
  • maximum value and minimum value
  • standard deviation
  • variance
  • range
  • quartile
  • mode
  • kurtosis
  • skewness
  • mode

maximum and minimum are already clear by name. We will start from mean

Mean: Mean is defined as the ratio of the sum of all the observations in the data to the total number of observations. This is also known as Average. Thus mean is a number around which the entire data set is spread.

Median : Median is the point which divides the entire data into two equal halves. One-half of the data is less than the median, and the other half is greater than the same. Median is calculated by first arranging the data in either ascending or descending order.

  • If the number of observations are odd, median is given by the middle observation in the sorted form.
  • If the number of observations are even, median is given by the mean of the two middle observation in the sorted form.

An important point to note that the order of the data (ascending or descending) does not effect the median.

Mode : Mode is the number which has the maximum frequency in the entire data set, or in other words, mode is the number that appears the maximum number of times. A data can have one or more than one mode.

  • If there is only one number that appears maximum number of times, the data has one mode, and is called Uni-modal.
  • If there are two numbers that appear maximum number of times, the data has two modes, and is called Bi-modal.
  • If there are more than two numbers that appear maximum number of times, the data has more than two modes, and is called Multi-modal.

Variance — Variance measures how far are data points spread out from the mean. A high variance indicates that data points are spread widely and a small variance indicates that the data points are closer to the mean of the data set.

Standard Deviation — The square root of Variance is called the Standard Deviation

Range — Range is the difference between the Maximum value and the Minimum value in the data set.

Quartiles — Quartiles are the points in the data set that divides the data set into four equal parts. Q1, Q2 and Q3 are the first, second and third quartile of the data set.

  • 25% of the data points lie below Q1 and 75% lie above it.
  • 50% of the data points lie below Q2 and 50% lie above it. Q2 is nothing but Median.
  • 75% of the data points lie below Q3 and 25% lie above it.

Skewness — The measure of asymmetry in a probability distribution is defined by Skewness. It can either be positive, negative or undefined.

  • Positive Skew — This is the case when the tail on the right side of the curve is bigger than that on the left side. For these distributions, mean is greater than the mode.
  • Negative Skew — This is the case when the tail on the left side of the curve is bigger than that on the right side. For these distributions, mean is smaller than the mode.

If the skewness is zero, the distribution is symmetrical. If it is negative, the distribution is Negatively Skewed and if it is positive, it is Positively Skewed

Kurtosis — Kurtosis describes the whether the data is light tailed (lack of outliers) or heavy tailed (outliers present) when compared to a Normal distribution

Categorical data descriptive statistics:

In case of categorical data, there are very limited number of operations are present. Below are the operations that can be performed on non-numeric data

  • Determining the number of unique items.
  • Determining the frequency of those items
  • Determining the most frequently occurring item (mode of distribution)
  • Determining the rarest item

Inferential Statistics

Inferential statistics allow us to draw conclusions about the population from sample data that might not be immediately obvious. Inferential statistics emerges due to the fact that sampling naturally leads to a sampling error, and therefore the sampling is not expected to perfectly reflect the population. The methods of inferential statistics are:

  • parameter estimation
  • hypothesis testing

Thanks for reading!!

--

--