Statistics & Probability — Exploratory Data Analysis
Exploratory Data Analysis (EDA)
This series of articles inspired by Statistics with R Specialization from Duke University. The full series of articles can be found here.
Data Types
Identifying the type of variable you’re working with is always the first step of the data analysis process.
This later leads to easily determine which type of analysis is most appropriate.
Numerical (quantitative)
They are numerical values, sensible to add, subtract, take averages, etc.
- Continuous. A number within a range of values, usually measured, such as height (within the range of human heights).
- Discrete. Only take certain values (can’t be decimal), usually counted, such as the count of students in a class.
Categorical (qualitative)
Take a number of distinct categories, but it wouldn’t be sensible to do arithmetic operations. Sometimes we encode the categories to numerical values, act as place holders for the levels of the category.
- Ordinal. Values that have inherent ordering levels, such as (high, medium, low).
- Categorical. If they don’t have inherent ordering, such as gender (male or female).
Exploring Numerical Data
Next, we start looking for relationships between variables.
Scatter plot
A common plot for visualizing the relationship between two numerical variables.
We identify the explanatory (income), which is suspected in affecting the other, and the response variables (life expectancy). We place the explanatory variable on the x axis and the response variable on the y axis.
Line plot
Same as the scatter plot in addition to a line plotted connecting all these dots.
Histogram
One good way of visualizing the distribution of a numerical variable.
Data are binned into intervals and height of the bars represent the number of cases that fall into each interval.
They are also very useful for identifying shapes of distributions. The distribution of life expectancies appear to be left skewed between 65 to 85 years old — Below, we’ll discuss skewness.
When the bin width is too wide, we might lose interesting details. When the bin width is too narrow, It might be difficult to get an overall picture of the distribution. The ideal bin width depends on the data you’re working with.
Dotplot
A dotplot is especially useful when individual values are of interest such as the count of each life expectancy. However, as the sample size increases, the dotplot may get too busy.
Box plot
Another visualization highlighting outliers, max, min, median, interquartile range, and also skewness (shape).
Interquartile range (IQR). Given a list of values, sort, and then split them into 4 quarters.
The range from the end of Q1 till start of Q3 (the middle 50% of the data) is the box in the box plot, where the line inside is the median.
Let’s take an example. Suppose the values of life expectancies column are:
1,2,2,2|2,3,3,3|5,8,10,15|26,30,45,50
To calculate the quartiles and IQR
* Quartile 1 (Q1) = 25% = (2+2)/2 = 2 => 25% of numbers <= 2
* Quartile 2 (Q2) = 50% = (3+5)/2 = 4 => 50% of numbers <= 4 (media)
* Quartile 3 (Q3) = 75% = (15+26)/2 = 20.5 => 75% of numbers <= 20.5
* IQR = Q3 — Q1 = 20.5–2 = 18.5 (not really helpful on its own, only when comparing different distributions i.e. data variability)
So, we can say that the “middle 50%” of countries, have life expectancies between 2 and 20.5 years old.
Measuring the outliers. Values that are considered to be unusually low or high lie more than (1.5*IQR) away from the Q1 and Q3 quartiles.
Intensity Map
For certain types of data, it might be useful to view the location distribution.
We can see that both income and life expectancy are lower in Africa, but higher in North America and Europe.
Evaluate the relationships between variables
- Direction: It it increasing or decreasing.
- Shape: Is it linear, or does it follow some other form (like curve)?
- Strength: Is the relationship strong? Indicated by little scatter around the line (or curve). Or weak, indicated by lots of scatter; data is far away from the line (or curve).
- Outliers: And any potential outliers. It’s always a good idea to investigate these points carefully to make sure they’re not data entry errors.
Skewness
Distributions are set to be skewed to the side of the long tail.
In a left skewed distribution, the longer tail is on the left. And in a right skewed distribution, the longer tail is on the right.
If no skewness is apparent, then the distribution is said to be symmetric.
Modality
A distribution might be unimodal with one prominent peak, bimodal with two prominent peaks, or uniform with no prominent peaks.
With more than two prominent peaks a distribution is usually said to be multimodal.
A bimodal distribution might indicate that there are two distinct groups in your data.
A distribution of heights of individuals at a preschool. The first peak might be the kids and the second might be the teachers.
A uniform distribution means there’s no apparent trend in the data. That all values of the variable are equally likely to occur.
A distribution of date of birth (only month) show no trend as just as likely to be born in any month.
Measures of Center & Spread
Measures of Center
- Mean: Arithmetic average
- Median: Mid point of the sorted data (50th percentile; 50%)
- Mode: The most frequent value. If numeric data is continuous, then it’s less valuable since very unlikely to observe the same mode multiple times.
If these measurements are calculated from a sample, they’re called sample statistics. Sample statistics are point estimates for the population parameters, if the sample is a good representative of the population.
A “statistics”, is any summary number, like the mean, that describes the sample, and A “parameter” is for the entire population.
Mean, Median, and Skewness
In the left skewed distribution, the mean is generally smaller than the median since there’s a long tail to the left, the mean is being pulled to the lower end by the observations in the lower tail.
In symmetric distributions, the mean and the median are now roughly equal to each other.
And in right skewed distributions, the mean is generally higher than the median since the few high valued observations pulled the average up.
The shorter the left or right tail, the closer the mean to the median.
Measures of Spread
Variability or spread is the measure of how close or far the data lie from the center of the distribution.
The less variability, the more data are closer to the center. The more variability, the more data are spread further away from it.
— Range (max, min)
While it’s easy to calculate, this is not a very reliable measure of variability of the sample, since it depends on the two most extreme values, the end points of the distribution.
— Variance
It’s used to indicate how widely the observations vary form each other.
To calculate, first find the difference between the mean and each observation, and square each of these. Then, sum altogether, divide by N-1.
— Standard deviation
This is basically square root of the variance. The standard deviations can also tell where most of the data lie besides variability (we’ll discuss it later).
The unit of variance is squared. This is actually somewhat annoying, since the result, 83.06 years squared, is not very meaningful. And therefore, we turn to standard deviance.
The lower the standard deviation, the closer the values to the mean, and so the closer the values to each other, and the lower the variability.
— Interquartile range (IQR)
IQR is a more reliable measure of the spread than the range, because it doesn’t rely on the endpoints, the unusual observations or outliers.
— Robust statistics
We define robust statistics as measures on which extreme observations have little effect.
1,2,3,4,6 mean => 3.5, median => 3.5
1,2,3,4,1000 mean => 169, median => 3.5 (same)
While the mean depends on all observations in the data set, the median only depends on the midpoint of the distribution, and the values of the end point (extreme) are irrelevant to its calculation.
The mean is affected by outliers (dragged to left or right).
- Robust statistic of center: median vs mean
- Robust statistic of spread: IQR (depends on the median) vs standard deviation (relies on mean)
Robust statistics are most useful for describing skewed distributions, or those with extreme observations.
While non-robust statistics like mean and standard deviation are useful for describing symmetric distributions.
Transforming Data
Data transformations are useful tricks for making certain types of data easier to model. A transformation is a rescaling of the data using a function. Why?
- We might want to see the relationship a little differently (reverse the line from positive to negative)
- We might want to reduce skew to assist in modeling (when having such many outliers)
- We might want to straighten a nonlinear relationship (like a curve) in a scatterplot.
The most common used functions are log (often applied when data > 0), square root, and inverse.
Exploring Categorical Variables
(~) A poll asked respondents how difficult they think it is to save money.
Bar plot
A graphical way of representing a single categorical data.
It tells the count of each category, but usually, we consider the relative frequencies (percentage).
Bar plots are used for displaying distributions of categorical variables, while histograms are used for numerical variables.
Pie chart
Also used for single categorical data but a pie chart is actually much less informative then a bar plot, particularly, when there are many levels in a categorical variable.
Contingency table
We might wonder if whether people think it’s difficult or easy to save money is related to their income.
To evaluate this, we organize these variables in a contingency table. There are three levels of the income we consider.
- Less than 40,000 per year
- Between 40 and 80,000 per year
- More than 80,000 per year
To visualize the distribution of the levels of one variable; the response variable, broken down (conditioned) on another, and hence evaluate if there is a relationship between difficulty (response) and income (conditional; explanatory).
The column variable is treated as a conditioning variable, a “treatment”, and the row variable is treated as a “response”.
To find out the percentage of people who think it’s very difficult to save money, for each income.
<40K => 63% (128/202 * 100)
40-80K => 43%
>80K => 25%
Since the percentage varies greatly among the income categories, the data suggest that the two variables are associated (dependent).
Segmented bar plot
They are useful for visualizing the distribution of the levels of one variable; the response variable, conditioned on the levels of the other, the explanatory variable.
The heights of the bars indicate the numbers of respondents in various income categories. And the bars are segmented by color to indicate the numbers of those who think it’s very difficult, somewhat, etc.
Mosaic plot
In addition to what’s offered by segmented bar plot, it also shows the marginal distribution of income.
The marginal distribution is the width of the bars is telling us about the total number of people in each income category (<40K → 63% → 202
).
Side-by-side box plot
To visualize (and compare) the distribution of a numerical variable across the levels of a categorical variable.
In this plot:
- The medians are consistent, indicating that students belong to roughly equal numbers of clubs regardless of their year.
- The variability is higher for first-year and senior students, while lower for sophomores and juniors, as indicated by the IQRs.
- And among the sophomores and juniors, there are some students who belong to unusually low or high numbers of clubs.
Since distributions across the class years are pretty similar, the number of clubs students belong to might be independent of their class year.
Thank you for reading! If you enjoyed it, please clap 👏 for it.