From Noise to Knowledge: Mastering Exploratory Data Analysis and Outlier Detection

Data data everywhere, but are you truly aware?

Ahmad Suhail
Tensor Labs
10 min readJul 25, 2023

--

Exploratory Data Analysis (EDA) is a crucial step in data analysis, allowing us to gain insights and understand the underlying patterns and relationships within our datasets. In this article, we’ll look into the important questions you need to ask from your data and help you understand your data better.

Visualizing Data Efficiently and Quickly

Our first section will be about how can we speed up our insights about data and explore how it can be enhanced with new tools and techniques.

One of the challenges in EDA is effectively visualizing data to uncover meaningful patterns. Traditional plotting libraries often require significant coding skills and can be time-consuming. However, there are now new tools available that simplify this process.

SweetViz

Sweetviz is one such tool that automates the generation of comprehensive EDA reports with just a few lines of code. It provides an overview of key statistics, distributions, correlations, and more. By using Sweetviz, analysts can quickly identify trends and outliers within their datasets.

import sweetviz as sv

my_report = sv.analyze(df)
my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

Bamboolib

Another tool gaining popularity is bamboolib, which offers a user-friendly interface for interactive data exploration. It allows users to easily filter, sort, group, pivot, and visualize data without writing any code. With Bamboolib’s intuitive features, even non-technical users can perform advanced EDA tasks effortlessly.

The best thing about bamboolib, which sets it apart from others is its ability to generate code for data manipulation.

import bamboolib as bam
bam.enable()

df

Here’s a short tutorial for you to get started with bamboolib.

Don’t forget to change back to the normal pandas data frame if you wish to use the usual data frame.

bam.disable()

yData-Profiling

y-Data’s Profiling library (ydata-profiling) is another powerful tool for EDA. It generates detailed reports on various aspects of the dataset such as missing values, unique values, data types distribution histograms, correlation matrices, and more. This library provides valuable insights into dataset quality and helps identify potential issues or anomalies. With yData-Profiling, you can quickly understand the distribution of your data, identify missing values, detect outliers, and gain a deeper understanding of the relationships between variables.

from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Pandas Profiling Report")

profile.to_file("ydata_profiling_report.html")

By utilizing these new tools — Sweetviz, bamboolib, and ydata-profiling — analysts can take their exploratory data analysis beyond the ordinary. These tools not only simplify complex tasks but also provide rich visualizations and insights, enabling analysts to navigate the data universe more effectively and unveil hidden patterns.

Then When to Normalize the Data

Alright, let’s talk about normalizing datasets to reduce standard deviation and skewness. When we say “normalize,” we mean transforming the data in a way that makes it easier to analyze and compare.

Standard Scaler

One common method of normalization is using StandardScaler, which scales the data so that it has a mean of 0 and a standard deviation of 1. This helps to reduce the impact of outliers and brings all the values onto a similar scale. It is useful when the distribution of the data is Gaussian and the range of the data varies widely. Here’s how you can use Standard Scalar in Python:

from sklearn.preprocessing import StandardScaler

# Create a Standard Scalar object
scaler = StandardScaler()

# Fit and transform the data
df_standardized = scaler.fit_transform(df)

MinMax Scaler

Another approach is MinMax normalization, where the data is scaled to a specific range, usually between 0 and 1. This method preserves the relative relationships between the data points while ensuring they fall within a consistent range. It is useful when the distribution of the data is not Gaussian and the range of the data varies widely. Here’s how you can use Min Max Scalar in Python:

from sklearn.preprocessing import MinMaxScaler

# Create a Min Max Scalar object
scaler = MinMaxScaler()

# Fit and transform the data
df_normalized = scaler.fit_transform(df)

Box-Cox Transform

If you’re dealing with skewed data, you might consider using Box-Cox Transform. This technique applies a power transformation to make the distribution more symmetric. It can be particularly useful when dealing with highly skewed or non-normal distributions. The box-cox transform calculates a lambda value for each data column which you will need later to convert back the values from normalized form to non-normalized form.

from scipy.stats import boxcox
from scipy.special import inv_boxcox

lambda_values_dictionary = {}

# Apply Box-Cox Transform
df_transformed = filtered_data.copy()
for col in filtered_data.columns:
df_transformed[col], lambda_values_dictionary[col] = boxcox(filtered_data[col])

for lambda_value in lambda_values_dictionary.values():
if key == "target_variable":
df_transformed[key] = inv_boxcox(df_transformed[key], lambda_value)

By normalizing your dataset using techniques like StandardScaler, MinMax normalization, or Box-Cox Transform, you can make your data more suitable for analysis and reduce any biases introduced by variations in scale or distribution.

But be aware that normalizing your data may not always be the best approach. Normalization can be sensitive to outliers, which are data points that are significantly different from the rest of the data. If your dataset contains outliers, normalization may not be appropriate because it can distort the distribution of the data.

Outlier Detection 101

A Step-by-Step Guide to Finding Influential Observations

So you’ve got a dataset and you’re ready to dive into some data analysis. But wait — have you checked for outliers yet? Outliers are those weird, wacky data points that don’t seem to fit with the overall pattern. They can seriously mess up your analysis and lead you to draw incorrect conclusions. Never fear, detecting outliers is actually pretty straightforward. In this section, I’ll walk you through how to spot influential observations in your dataset and decide whether to keep or remove them. By the end, you’ll be an outlier detection pro, ready to continue with your analysis knowing your data is squeaky clean. Let’s get started!

Detecting Outliers Using Z-Score

Z-score is a statistical measure that represents the number of standard deviations a data point is from the mean of a dataset. It is commonly used for outlier detection, where outliers are defined as data points that are significantly different from the rest of the data.

Z-score is a useful technique for outlier detection when the data is normally distributed or approximately normally distributed. In such cases, the mean and standard deviation can be used to define a range of values that are considered “normal” for the dataset. Any data point that falls outside this range can be considered an outlier.

However, it is important to note that Z-score may not be appropriate for outlier detection in all cases. For example, if the data is not normally distributed, Z-score may not accurately capture the range of “normal” values. In such cases, other techniques such as interquartile range (IQR) or Tukey’s method may be more appropriate.

    z_score_threshold = 3.0
df = df[(np.abs(stats.zscore(df)) < z_score_threshold).all(axis=1)]

Detecting Outliers Using Inner Quartile Range

IQR (Interquartile Range) is a statistical measure that represents the range of values that cover the middle 50% of a dataset. It is commonly used for outlier detection, where outliers are defined as data points that fall outside a certain range of values.

IQR is a useful technique for outlier detection when the data is not normally distributed or when the data contains extreme values (also known as skewness or heavy-tailed distribution). In such cases, the mean and standard deviation may not accurately represent the central tendency and variability of the data.

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
k = 1.5
IQR = Q3 - Q1
df = df[~((df < (Q1 - k * IQR)) |(df > (Q3 + k * IQR))).any(axis=1)]

Detecting Outliers Using Cook’s Distance

Cook’s distance is a useful metric for detecting outliers in your dataset that have a disproportionate influence on your regression model. To calculate Cook’s distance, you’ll first need to run a linear regression on your data.

Once you have your regression results, here are the steps to find influential observations:

  1. Identify the residuals (errors) for each data point. This is the difference between the actual y value and the predicted y value from your model.
  2. Calculate the leverage for each point. The leverage measures how far each x value is from the mean x value. Points far from the mean have higher leverage.
  3. Square the residuals and leverage values for each data point.
  4. Multiply the squared residuals and leverage values to get the Cook’s distance for each point.
  5. Look for points with a Cook’s distance greater than 4/n, where n is the number of observations. These are your influential outliers.
  6. Consider removing these outliers and re-running your analysis. See how much the results change with the outliers gone. If the results change substantially, your model may not be robust to outliers.

By checking Cook’s distance, you can identify points that are having a disproportionate pull on your model and potentially skewing your results. Removing or downweighting these outliers can lead to a more stable and generalizable model.

Detecting and managing outliers is an important step in any data analysis. Cook’s distance provides a simple way to identify the most problematic points so you can build the most accurate models possible.

Here’s the code snippet:

import statsmodels.api as sm
import matplotlib.pyplot as plt

def perform_regression(X, y):
X = sm.add_constant(X) # Add a constant column for the intercept
model = sm.OLS(y, X)
results = model.fit()
return results
# Calculate Cook's distance
def calculate_cooks_distance(results):
cooks_distance = results.get_influence().cooks_distance[0]
return cooks_distance
# Draw scatter plot for Cook's distance and target variable
def plot_cooks_distance(cooks_distance, y):
plt.scatter(y, cooks_distance )
plt.xlabel("Target Variable")
plt.ylabel("Cook's Distance")
plt.show()
# Perform prediction
def perform_prediction(results, X):
X = sm.add_constant(X)
predictions = results.predict(X)
return predictions
# Calculate threshold
def get_threshold(cooks_distance):
mean_cooks_distance = np.mean(cooks_distance)
threshold = mean_cooks_distance * 3
return threshold
# Get Mean Absolute Error
def get_mae(y_true, y_pred):
# mae = np.mean(np.abs(y_true - y_pred))
mae = mean_absolute_error(y_true, y_pred)
return mae

# Exclude outliers based on Cook's distance
def exclude_outliers(df, cooks_distance, threshold):
df_filtered = df[cooks_distance <= threshold]
return df_filtered
# Retrain on new dataframe with removed outliers
def retrain_model(df, target):
X = df.drop(target, axis=1)
y = df[target]
results = perform_regression(X, y)
return results

Removing Outliers and Evaluating Their Impact

Removing outliers from your data is an important step to ensure your analysis and models are not skewed by extreme values. Here are the steps to detect and remove influential observations:

  • Visualize your data. Plot your data points on a graph to spot any obvious outliers. Look for points that are separated from the main group. These could be errors or truly extreme values.
  • Calculate summary statistics. Find the mean, median, standard deviation, min and max. Values that are 3 standard deviations from the mean may be outliers. However, in small datasets, this may be too strict of a rule. Use your judgment.
  • Run regression diagnostics. If building a regression model, look at the studentized residuals, leverage, and Cook’s distance values. Points with high values on these metrics have a strong influence on the model and may warrant removal.
  • Remove outliers and re-analyze. Drop any points you identify as outliers re-calculate your summary stats and rebuild your models. See how the results change. If there are major differences, the outliers were significantly impacting your analysis.

Evaluating the Impact

It’s important to evaluate how removing outliers changes your results. Some questions to ask:

  • How did summary statistics like the mean and standard deviation change? Are they more in line with expectations?
  • How did your regression model coefficients and R-squared value change? The model should improve if the outliers were detrimental.
  • Do the data visualizations look more normal? The distribution should appear more even without extreme points.
  • Were any insights or conclusions altered substantially? If so, the outliers were likely obscuring the real patterns in the data.

If you think finding all these stats is painstaking and time-consuming, then don’t worry. There is a method for the OLS model, that returns a summary of the trained model.

results = perform_regression(X, y)

results.summary()

Voila! There you go, all the stats that you needed quickly.

Conclusion

In conclusion, we’ve embarked on an exhilarating journey exploring the vast world of Exploratory Data Analysis (EDA) and its indispensable tools. With the advent of cutting-edge technologies, the data landscape is continually evolving, providing us with exciting opportunities to unravel insights like never before.

Our voyage began with an enlightening discussion on the latest EDA tools that are revolutionizing the way we explore data. From intuitive dashboards to interactive visualizations, these tools have opened doors to endless possibilities. Gone are the days of tedious and monotonous data exploration; today, we can effortlessly dive into the depths of our datasets, surfacing valuable patterns and trends with just a few clicks.

Furthermore, we delved into the critical aspect of data normalization, a technique that ensures fairness and consistency in our analysis. By eliminating biases caused by differences in units, scales, or distributions, we unlock a level playing field for our variables. This normalization process enables us to make meaningful comparisons and uncover hidden connections that may have remained concealed otherwise.

But what about the unexpected guests that occasionally gatecrash our data parties? Outliers! These notorious anomalies can wreak havoc on our analysis, leading to skewed results and misleading insights. Fear not, for we’ve also explored the art of outlier detection and removal. Armed with sophisticated algorithms and statistical methods, we can now effectively identify and handle these unruly data points, restoring integrity to our analysis.

As we conclude this article, let us reflect on the significance of these EDA tools, data normalization, and outlier detection and removal techniques. They serve as powerful instruments in the hands of data enthusiasts, allowing us to unearth fascinating narratives hidden within vast amounts of information. So, let’s embark on this exciting journey, armed with these tools, and navigate the ever-changing sea of data with confidence, curiosity, and enthusiasm. The world of exploration awaits us, ready to be deciphered one data point at a time!

--

--