Descriptive Statistics (Data Science)

Published in

Essential Statistics for Data Science

5 min readMar 11, 2020

What is Descriptive statistics?

By definition, Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in a meaningful way such that, for example, understanding the pattern, spread, range of the data.

Why is it important?

Descriptive statistics are very important because if we simply present our raw data it would be difficult to visualize or understand what are the features of the data and what it is trying to portray. Descriptive statistics, therefore, enables us to present the data in a more meaningful way, which allows simpler interpretation of the data or helps us to proceed one step further i.e inferential modeling.

What does Descriptive statistics tell us about a dataset?

The two main important parts of descriptive statistics in terms of data are the Measure of Central Tendency and Measure of Dispersion.

The measure of Central Tendency:

A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. These measures indicate where most values in a distribution fall and are also referred to as the central location of a distribution

Mean, Mode, median.

Mean : (7+ 3+8)/3 = 6

Mode: 1 1 1 2 2 2 2 2 — → mode 2 (unimodal),

2 2 3 4 5 5 — — -> mode 2,5 (Bimodal

Median:

1 1 1 1 1 13 2 2 3 4 6 ( left count = right count,from 13)

but, 3 4 7 9 12 15 . median ???

3 4 7 9 12 15

7+ 9=16

16/2 = 8 — -> median

The measure of Dispersion:

Inter Quartile Range:

A quartile is a statistical term describing a division of observations into four defined intervals based upon the values of the data and how they compare to the entire set of observations.

How to find quartile from a data set.

65, 60, 59 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 98,95,90

steps

Put the numbers in order(sort the numbers)
cut the list of data into 4 equal parts
The cuts are the quartile

59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98

First, mark down the median, Q2, which in this case is the tenth value — 75.

Q1 is the central point between the smallest score and the median. In this case, Q1 falls between the first and ninth score — 68. [Note that the median can also be included when calculating Q1 or Q3 for an odd set of values. If we were to include the median on either side of the middle point, then Q1 will be the middle value between the first and tenth score, which is the average of the fifth and sixth score — (fifth + sixth)/2 = (68 + 69)/2 = 68.5].

Q3 is the middle value between Q2 and the highest score — 84. [Or if you include the median, Q3 = (82 + 84)/2 = 83].

Q1 is lower quartile (25 %)

Q2 is middle quartile or median (50%)

Q3 is upper quartile (75%)

Standard deviation:

Standard deviation is only used to measure spread or dispersion around the mean of a data set.

The standard deviation is never negative.
Standard deviation is sensitive to outliers. A single outlier can raise the standard deviation and in turn, distort the picture of the spread.
For data with approximately the same mean, the greater the spread, the greater the standard deviation.
If all values of a data set are the same, the standard deviation is zero (because each value is equal to the mean).

Distribution of Data (Graphical Representation):

Symmetric (Normal Distribution/Bell curve): symmetric distribution is a type of distribution where the left side of the distribution mirrors the right side. By definition, a symmetric distribution is never a skewed distribution. The normal distribution can be bimodal and unimodal.

Asymmetric(skewness): Asymmetrical distribution is a situation in which the values of variables occur at irregular frequencies and the mean, median and mode occur at different points.

skew=(mean-mode)/std dev

*Kurtosis — In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness

Advanced Descriptive Statistics:

1.The mean of a frequency table

x(score) — — — — — f(frequency)

1 — — — — — — — — — - 2

2 — — — — — — — — —4

4 — — — — — — — — — 7

5 — — — — — — — — —7

mean = Σ x* f / Σf = ( 1*2 +2*4 +4 *7 + 5* 7) / (2+4+7+7) =39.75

2. Weighted mean(Expectation Value)

1,2,3,4

1+2+3+4/4 = 1*1/4 +2 *2/4+3*1/4+4*1/4

So, we can presume 1/4 or 0.25 is the weight for each value of 1,2,3,4

but,if weight for 1 is 0.3, 2 is 0.1, 3 is 0.1, 4 is 0.6 ????

mean = 1*0.3 + 2*0.1 +3*0.1+4*0.6 = 3.2

The weighted mean is used when we want to rate something more important than others.
Ex. You want to buy a camera, now your friend suggested 2 options sony and canon with the same features. But now you are confused about which one you should go for buying??

solution

suppose

image quality = 50% battery life = 30 % zoom range = 20%

Features — — — — — sony rating — — — — — — — — — -canon rating

image quality — — — — - 8 — — — — — — — — — — — — — 9

battery life — — — — — — 6 — — — — — — — — — — — — — 4

zoom range — — — — — — 7 — — — — — — — — — — — — - 6

sony — -> 8* 0.5 + 6*0.3 + 7 *0.2 = 7.2

canon — -> 9* 0.5 + 4*0.3 + 6 *0.2 = 6.9

So, according to the rating, you should go for a sony.

Descriptive Statistics (Data Science)

Written by Antika Das