The First Chapter of Data Analysis — Part I

Published in

Analytics Vidhya

6 min readApr 30, 2021

The first step in any data science project is exploratory data analysis (EDA)!

It’s been around 60 years since the eminent American statistician, John W. Tuckey (1915–2000), published a revolutionary paper called “The Future of the Data Analysis” which proposes new scientific discipline called data analysis. In classic statistics, the statisticians mostly limit their attention on the inference, as a complex procedure on drawing a conclusion for a population based on the limited number of samples. John Tuckey, established a link between statistics and computational science such as the computer science and engineering. In its now-classic book, Exploratory Data Analysis, he has exploited simple plots along with statistics summary to give a big picture on the studied topic.

In classical statistics they design the experiment and then collect data whilst in data science they use already-collected data as a basis for knowledge discovery!

In the first part of EDA, we will mainly focus on the summary statistics, the metrics used and how they are implemented in python language. We start by data structure and types which is the foundation of any data-driven decision.

Data Structure

Data comes from different resources such as IoT devices spew out torrent of data to be analysed. The first and the most critical object of the data science is to harness the possibly large volume of the raw data and convert it into an actionable knowledge. Most of the time, the raw data is collected in the form of the unstructured or semi-structured data. In order to extract the knowledge, the unstructured data should be converted into the structured data. A typical form of structured data is rectangular data (sometimes is called data frame) which consists of many rows, called records, with multiple columns, called features. In Python, with the Pandas library, the basic rectangular data structure is a DataFrame object.

Data Types

After the structure of the data, you should know the data types at hand. But why knowing the data types is so important? In the process of data science and predictive modelling, the data type is critical in adopting the type of visual display, data analysis and statistical model. It also determines how a software handles the computational side of variable.

Data types is categorised into numeric and categorised types. The numeric data type is the data that can be expressed in numeric scale. It can be continuous, which takes any value in an interval, or discrete, which can take only integer values. In categorical data types, data can only take a specific set of values representing specific categories. It can be nominal type, which consists of just different names or can be ordinal where there is explicit ordering such as low, medium, high.

Statistics Summary

Location

A basic step in EAD is finding out the typical value for a given feature. It determines the location, central tendency of a given features. In the following different approaches to estimate the location is presented. This part is mainly inspired by book “Practical Statistics for Data Scientists”.

Mean: The sum of all values divided by the number of values.

The python implementation is:

import numpy as npx = [1, 2, 2, 1, 23, 4, 5, 6, 2, 3]
np.mean(x)

Trimmed Mean: It is calculated by dropping a fixed number of sorted values at each end and then taking an average of the remaining values. It is used to eliminate the influence of extreme values.

The python implementation for 10% trimmed mean is as follows

from scipy.stats import trim_meanx = [1, 2, 2, 1, 23, 4, 5, 6, 2, 3]
trim_mean(x, 0.1)

Weighted mean: It is calculated by multiplying each data value x_i by a user-specified weight w_i and dividing their sum by the sum of the weights. The formula is:

We use the weighted mean in mostly two cases: When some values are more variable (less reliable) than others. So, the more variable a value is the lower the weight is. Another case, when every sample comes from a group with different size. The python3 implementation is as follows:

import numpy as npx = [1, 2, 2, 1, 23, 4, 5, 6, 2, 3]
w = [10, 12, 32, 34, 32, 45, 21, 56, 21, 100]
weighted_mean = np.average(x, weights=w)

Median: it is calculated by firstly sorting the values in ascending order and then take a value which splits the samples into two equal groups. In case where the number of values is even, it is the average of two middle values. Unlike to the mean which depends on all values, the median depends only on the middle values which makes it robust to the outliers. An outlier is any value that is very distant from the other values in a data set. The python implementation is:

import numpy as npx = [1, 2, 2, 1, 23, 4, 5, 6, 2, 3]
median = np.median(x)

Variability

Another statistics summary metrics, called variability, measures whether the values are tightly clustered or spread out. Another name for variability is dispersion. Just like the location estimator, we have different variability estimators.

Mean absolute deviation: It is calculated by averaging the absolute deviation of each value from the mean. The formula is:

The python implementation is

import numpy as np
from numpy import mean, absolutex = [1, 2, 2, 1, 23, 4, 5, 6, 2, 3]
mean_abs_dev = np.mean(np.absolute(x - np.mean(x)))

Variance: Another best-known estimator for variability is variance and its squared root knwon as standard deviation. The formula is

The python implementation is:

import numpy as npx = [1, 2, 2, 1, 23, 4, 5, 6, 2, 3]var = np.var(x, ddof=1)
sd = np.std(x, ddof=1)

Standard deviation is much more convenient in interpretation than variance as it is on the same scale of the values.

Median absolute deviation (MAD): Similar to the location estimator, the variance and hence standard deviation is susceptible to the outliers. MAD is another variability estimator which is based on the median and is more robust against outliers. The formula for MAD is:

The python implementation is as follows:

from statsmodels import robustx = [1, 2, 2, 1, 23, 4, 5, 6, 2, 3]
mad = robust.scale.mad(x)

The standard deviation is always greater than the mean absolute deviation, which itself is greater than the median absolute deviation.

Interquartile range (IQR): Another common measure of the variability is interquartile range or IQR shortly. It is the difference between 75 and 25-precentiles. In a data set, the P-th percentile is a value such that at least P percent of the values take on this value or less and at least (100 — P) percent of the values take on this value or more. The python implementation is:

import numpy as npx = [1, 2, 2, 1, 23, 4, 5, 6, 2, 3]
IQR = np.quantile(x, 0.75) - np.quantile(x, 0.25)

Wrap-up

In this article, we establish the basement for the data science. As the first part, we talked about statistics summary and introduce many location and variability metrics, and the case where each is useful. In the next part, we are going to talk about the most useful distriubtuon and visual display in EDA. Keep tuned!