Dealing with skewed data? A Practical Guide Part II

Beyond descriptives: Normality Under the Microscope

9 min readAug 12, 2024

Have you ever felt fear or anxiety about the skewed non-normal nature of your data in your thesis, research, or modeling? Uncertain about what to do with them?

In this series of three parts, I will guide you through essential steps to identify non-normality, skewed data, with two different approaches. We’ll discuss why skews occur, and whether/how we should transform them, offering practical examples.

In this part II, we’ll employ a formal approach through hypothesis testing and other tools. In the previous part, we explored informal ways by means of descriptive statistics to identify non-normality and skewness, and we discussed potential reasons for data being skewed. Both strategies are valid, useful, and can be applied simultaneously, depending on the study’s objectives, the type of analysis, and the researcher’s comfort level, among other factors.

Warning : This article provides an initial and summarized approach. We will touch upon extensive topics that require exhaustive study, often accompanied by a statistician or subject expert.

All graphs and tables in this series were created by the author.

Rigorous Methods in Normality Assessment

Formal statistical methods, grounded in mathematical statistics, employ probability theory and estimation techniques to draw rigorous inferences about populations. By quantifying uncertainty, these approaches provide stronger analytical capabilities than descriptive statistics alone when applied appropriately. Let’s begin!

1. Core Statistical Tests

There are two popular formal strategies for checking if variables follow a normal distribution: hypothesis tests and graphical representations.

Hypothesis tests for normality:

Here, the null hypothesis (H0) suggests that the data are normally distributed, while the alternative hypothesis (H1) indicates they are not. The p-values represent the probability of observing a distribution at least as extreme as the current one, assuming the data are from a population with a normal distribution.

Therefore, if p > 0.05 (for simplicity, we use the standard α = 0.05, I’ll dive into the details of this intriguing α, known as the error rate, in another article) we have no reason to reject the null hypothesis, so we assume our variable follows a normal distribution. It’s crucial to note that these statistical tests are highly sensitive to sample size.

Tip: If our objective requires the use of tools with strict normality assumptions, like a t-test, these formal methods are the way to go.

There are mainly two recognized tests that we cover in detail later on. For now, in summary:

Shapiro-Wilk: Used when the sample size is less than 30 (or considered small in the context).
Kolmogorov-Smirnov (Lilliefors): Used with sample sizes greater than 30.

Besides, there are some modifications of previous test such as:

Anderson-Darling: It’s a very powerful test, as it easily detects all deviations from normality (particularly sensitive to departures in the tails). However, it must be used carefully because it is severely affected by ties: When a considerable amount of observations have the same value the test tends to reject the null hypothesis, even if the data aligns well with a normal distribution.
Ryan-Joiner: This test measures the correlation between ordered data and normal scores (Z-scores/theoretical quantiles) to assess normality. It’s similar to the Shapiro-Wilk test but may be more effective for small samples. It’s commonly used in engineering, particularly for processes that occasionally produce individual outliers.

Let’s continue with our example: a dataset of 38 COVID-19 patients, including PCR data from pediatric and adult hospitalized cases, showing high variability in viral protein expression. If you don’t know what we’re talking about, click here.

But wait… Which test should we choose? Some sources in the literature and online resources suggest that samples smaller than 30 (or considered small) should undergo the Shapiro-Wilk test. Do 38 constitute a large or small sample size? We face a challenge here …

It is acknowledged that what is considered large or small size is debatable; this number is not set in stone and is recognized to be context-dependent. We will discuss this shortly. Nonetheless, we will apply both of the main tests to our variable of interest to compare and analyze the results — just to see what happens!

Note: Determining what constitutes a “large” sample size is a more complex issue than it appears and deserves detailed explanation in a separate article. This article serves as a friendly and concise guide.

Practical example:
First, we will adhere to the literature recommendations and apply the Kolmogorov-Smirnov (Lilliefors) for n > 30. Remember, I’ll use the same α = 0.05 for comparing all the results in the text.

statistic, p_value = lilliefors(data['relative_expression'])
print("Lilliefors Statistic:" + str(statistic))
print("P-Value:" + str(p_value))

In the results of this test, we obtained a p-value of 0.072, which is very close to 0.05 but still greater than 0.05. For that reason, we do not reject the null hypothesis, but it suggests there is some evidence that the data may not follow a normal distribution, although it is not entirely conclusive.
Now, we will apply the Shapiro-Wilk test for small sample sizes.

statistic, p_value = shapiro(relative_expression)
print("Statistics:" + str(statistic))
print("P-value:" + str(p_value))

The p-value is 0.006 (> 0.05) for this test, so there is sufficient statistical evidence to reject the null hypothesis in favor of the alternative. Therefore, we conclude that the data do not follow a normal distribution.

So, is the data normal or not? Well, our conclusions must necessarily involve a discussion about sample size, which we will address in the next section, as this will help us determine which test best suits our needs. However, if we find ourselves in a situation where the two most commonly used tests leave us with more questions than answers, we can resort other test or graphical methods.

Sample size discussion

In the biomedical field, researchers study highly complex phenomena, not only due to the hundreds of variables involved but also because many results and findings are often generalized to larger populations. This calls for sample sizes to be as large as possible. Within the literature, case and cohort studies with sample sizes in the hundreds are considered more relevant and reliable. Therefore, in our example, a sample size of 38 does not constitute a large sample.

But in other cases, for instance, in rare disease research or highly specialized clinical trials, a sample size of 30 or less might be considered large. In a study of a rare genetic disorder affecting only 1 in 100,000 people, gathering data from 30 patients could represent a significant portion of the affected population.

In economics, if you want to carry out an annual GDP analysis, 30 data points constitute a large and robust sample size. In contrast, when analyzing financial variables such as stock prices, the same number of observations would be considered a very small sample.

Because our example is a clinical study examining a global phenomenon like COVID-19, a sample size of 38 is definitely not large, so the “correct” test was the one commonly used for n < 30 (Shapiro Willk Test).

We must cultivate critical discernment to be careful in processing information available to us. It’s pretty easy to run into incomplete or misinterpreted information on the internet.

We also know that the challenges in determining the sample size (n) for a study are varied and are part of the daily life of researchers. However, this does not entirely prevent conducting a robust and appropriate statistical analysis. Researchers can adapt to any sample size and achieve their objectives either fully or partially.

2 .QQ-plots: Extended Visual Diagnostics

Q-Q plots are commonly used to compare a data set with a theoretical variable (by plotting their quantiles against each other), we will use here the QQ norm, which compares your data to a normal distribution.

QQ norms are pretty nifty tools for checking if your data follows a normal distribution. They’re especially useful before running statistical tests that assume normality. To read one, look at how closely your data points stick to the diagonal line. If they mostly follow the line, you’re probably dealing with normal data.

Watch out for S-shapes or curves, though — they might mean your data is skewed or has heavy tails. Remember, perfect alignment is rare, so don’t sweat small wobbles, especially with smaller sample sizes.

Be careful; interpreting these plots can be tricky, and sample size plays a big role in how they look. Let me show you some examples of typical shapes you might see — these patterns will probably look familiar when you check your own data.

Let’s do one with our example:

ax = pg.qqplot(np.array(relative_expression), dist='norm', confidence=.95)
ax.set_title('QQ Plot of Relative Expression')
ax.set_xlabel('Normal distribution');ax.set_ylabel('Data distribution')
plt.show()

Our QQ plot reveals that most points align closely with the normal distribution, with a few deviations and an outlier we previously identified. The R² value of 0.904 provides additional insight into the degree of normality. While this value indicates that our data isn’t perfectly normal, it’s remarkably close. Generally, an R² above 0.9 in a QQ plot suggests a strong alignment with normality.

Although the alignment is not perfect, basically all points are within the confidence bands which is the most crucial factor. This ambiguity reflects a common challenge in data analysis. While our data shows reasonable alignment with a normal distribution, it’s not a perfect fit and it displays a positive skew.

Note: I’ve chosen a variable with uncertainty, to show you how complex it can be to make a decision, in practice many times with some variables there are no doubts and from the first moment we know the answer.

3. Now what? Is my data really skewed?

As we wrap up, it’s clear we’ve encountered some ambiguity in our example. Through informal methods, we initially identified non-normality with what seems to be a right-skewed distribution. Besides, our formal approach revealed a different but complementary picture. While not normal, our data were surprisingly close to it.

This ambiguity isn’t unique to our example; it’s a common occurrence in real-world datasets, because it can be a behavior taken as normal or not, depending on the setup used and the objectives. Such situations require careful consideration. We must take the sensitivity of the statistical tools that we are going to use late in our analysis (if we are going to do it at all. Sometimes the descriptive and formal insights we have obtained so far are enough).

Ultimately, this exercise underscores a crucial point: context is king. The interpretation and handling of data ambiguities depend heavily on the specific circumstances of each study. As researchers, it’s our responsibility to consider all these factors when making decisions about data analysis and interpretation. This holistic approach ensures we draw the most accurate and meaningful conclusions from our research.

Here’s an overview of the two methods we’ve used:

The Road to Normal: What Comes After Detection

We learned that non-normal data is quite common in real-world research. However, depending on your analytical needs, you may need to transform a variable to approximate a normal distribution.

The next step involves applying transformations and leveraging various statistical tools. We have several methods at our disposal to achieve our objectives. The choice of strategy depends on our specific goals, the nature of our data, and the context of our research. Some approaches are more flexible with departures from normality, while others require stricter adherence. We’ll select the most appropriate method based on these factors.

Should we always transform our data to serve our goals? We will discuss this. See you in Part III …

References:
1.Stine, R. A. (s.f.). Explaining Normal Quantile-Quantile Plots through Animation: The Water-Filling Analogy. Department of Statistics, The Wharton School of the University of Pennsylvania. http://www-stat.wharton.upenn.edu/~stine/shiny/quantile_plot.pdf

2.Simon, H. A. (s.f.). On a class of Skew Distribution Functions. SNAP: Stanford Network Analysis Project. https://snap.stanford.edu/class/cs224w-readings/Simon55Skewdistribution.pdf

3.Handling Skewed Data: A Comparison of Two Popular Methods. (s.f.). MDPI. https://www.mdpi.com/2076-3417/10/18/6247