How to Explore Your Data Effectively Using Python

A Step-by-Step Ultimate Guide to Understanding Your Data Better!

Richard Warepam

Published in

ILLUMINATION

5 min readMar 27, 2024

“Experts often possess more data than judgement.”
– Colin Powell

Are you the one who jumps right into data analysis or model building as soon as you get the data?

And you get sad after the results you get are nowhere near the desired ones!

In data science, Its all about the data. If you don’t explore and understand the patterns, trends, and anomalies within the data, you aren’t as good as you think you are.

Data exploration is the most basic and fundamental step in any data science project.

And, here, you will learn how to explore them in the most effective way.

🔌Plug-in (40 % OFF on all products)

Learn data science with me. I have authored these e-books for data science and AI tools:

Use Code: “MEDIUM40”

Best Selling eBook: Top 50+ ChatGPT Personas for Custom Instructions

Cheapest Bundles:

Let’s get started!

One-Dimensional Data Exploration:

Also known as “univariate data,” it refers to a dataset that consists of a single variable. Essentially, it’s a collection of numbers.

For instance, it could be the average daily time each user spends on your website or the number of pages in each book in your data science library.

The first step to exploring this type of data is to summarize its main characteristics using descriptive statistics, followed by visualization techniques like histograms.

How To Perform Statistical Analysis Using Python: The Ultimate Guide

5 Proven Methods that Every Data Science Professional Uses

medium.com

Descriptive statistics provide you with the number of data points, the smallest and largest values, the mean, and the standard deviation. However, this alone doesn’t provide a comprehensive understanding.
Therefore, the next step is to visualize the data using histogram plots. This helps you understand your data’s distribution. Additionally, box plots can be used to identify outliers and understand the skewness and kurtosis of your data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Sample data
data = pd.Series(np.random.randn(1000))

# Descriptive Statistics
print(data.describe())

# Visualization - Histogram
plt.hist(data, bins=30)
plt.title('Histogram of One-Dimensional Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

# Visualization - Boxplot
sns.boxplot(data)
plt.title('Box Plot of One-Dimensional Data')
plt.show()

And that’s how you perform “One-Dimensional Data Exploration”.

Two-Dimensional Data Exploration:

Also known as “bivariate data,” this involves two different variables.

Imagine you have a dataset with two dimensions. For instance, in addition to the daily minutes spent on your website, you also have data on “years of data science experience.”

Naturally, you’d want to understand each dimension individually, but you’d also want to scatter the data to see the relationship between the two variables.

So, how do you explore this type of data?

First, visualize the data using a “Scatter Plot”. This will show you the relationship between the quantitative variables and help you spot correlations, trends, and outliers.
After identifying the relationship, you’d want to measure its strength and direction, right? That’s where the “Correlation Coefficient” comes into play. So, calculate the correlation coefficient to quantify the relationship.
Lastly, if your data is categorical, implementing a cross-tabulation will help you understand the relationship between the variables.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Sample data
df = pd.DataFrame({
    'variable1': np.random.randn(1000),
    'variable2': np.random.randn(1000)
})

# Scatter Plot
sns.scatterplot(x='variable1', y='variable2', data=df)
plt.title('Scatter Plot of Two-Dimensional Data')
plt.show()

# Correlation Coefficient
print(df.corr())

# Cross-tabulation
tab = pd.crosstab(df['variable1'] > 0, df['variable2'] > 0)
print(tab)

I hope you've been able to grasp the concepts up until now. Let’s dive into another type of data exploration next.

Multi-Dimensional Data Exploration:

As the name suggests, multi-dimensional data involves three or more variables.

With multiple dimensions, you’d naturally want to understand how they are related.

Note: Keep in mind that as the number of dimensions increases, so does the complexity, but also the insights you can gain.

To explore this type of data, the first and simplest way is to examine the “correlation matrix.”

Decoding “CORRELATION” in Data Mining: The Ultimate Guide

Feeling like you’re on a treasure hunt in the world of data, trying to find hidden links in your business numbers?

medium.com

This matrix can be visualized using “heat maps,” which display the correlation between variables at a glance, highlighting areas of high and low correlation.
Alternatively, you can use a “pair plot” for visualization. It shows all the pairwise scatter plots between multiple variables in a dataset, providing a quick way to understand the relationship between all pairs of variables.
Lastly, techniques like PCA (Principal Component Analysis) can be used. PCA helps reduce the dimensionality while ensuring most of the information is retained.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

# Sample data
df = pd.DataFrame(np.random.randn(1000, 5), columns=['var1',
                                                     'var2',
                                                     'var3',
                                                     'var4',
                                                     'var5'])

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt=".2f")
plt.title('Heatmap of Multi-Dimensional Data')
plt.show()

# Pair Plot
sns.pairplot(df)
plt.suptitle('Pair Plot of Multi-Dimensional Data')
plt.show()

# Dimensionality Reduction with PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df)
df_pca = pd.DataFrame(data = pca_result, columns = ['PC1', 'PC2'])
print(df_pca.head())

This is how multi-dimensional data exploration is typically performed.

Conclusion:

In this article, I’ve strived to provide a comprehensive understanding of how to effectively explore data in three distinct ways.

We began with exploring each variable individually (One-Dimensional Data Exploration) and gradually moved to exploring multiple variables simultaneously, including their interrelationships.

Remember, data exploration is an iterative process. Continually apply the insights from this article to your projects and keep enhancing your data science skills.

Hey, If you need any Data Science Project related gigs and Ghostwriting gigs:
📥 Contact: richardwarepam16@gmail.com
💡 Join my community of learners! Subscribe now and instantly receive 2 FREE eBooks: “1000 Money Making ChatGPT Prompts” and “Full Stack Data Science Project Prompts.” — https://yourdataguide.substack.com/
⭐️ Visit My Gumroad Shop