Unleashing the Power of Exploratory Data Analysis (EDA) Techniques

Muhammad Rayyan Athar
7 min readJun 11, 2023

--

Exploratory Data Analysis (EDA) is a technique used to examine and summarize the key features of a dataset, often using visual techniques. It involves several steps to gain insights into the data.

  1. Data Collecting
  2. Data Cleaning
  3. Data Preprocessing
  4. Data Visualization

Data Collection:

Data collection is the process of gathering relevant and reliable information or data from various sources. It involves systematically collecting, recording, and organizing data to support analysis, decision-making, and research purposes.

Once data is obtained, it is essential to check the data types of the features. The different types of features include:

  1. Numeric: These are quantitative variables that represent numerical values, such as age, height, or temperature.

2. Categorical: These are qualitative variables that represent distinct categories or groups, such as gender, nationality, or color.

3. Ordinal: These variables have a natural order or hierarchy among their categories, such as ratings (e.g., 1-star, 2-star, 3-star) or education levels (e.g., high school, bachelor’s, master’s).

4. Datetime: These variables represent dates and/or times, allowing for temporal analysis and tracking events over time.

5. Coordinates: These variables represent geographical coordinates, such as latitude and longitude, enabling spatial analysis and mapping.

To determine the data types or features of a dataset, you can use the following command:

Or we can also use this command:

Let’s examine the statistical summary of our dataset, we can use this command:

After completing the data collection process, we will now move towards the data cleaning phase.

Data Cleaning:

Data cleaning is the process of identifying and rectifying errors and inconsistencies in a dataset, including handling missing values, correcting formatting issues, and removing duplicates, to enhance data quality.

First check the missing values in our dataset by using this command:

It calculates the number of missing values (null values) for each feature/column in the dataset, providing a summary of the missing data count for each column.

How to deal with the missing data ?

When dealing with missing data, consider the following strategies:

  1. Removal: Remove rows or columns with missing values using “dropna()” in pandas, if feasible and the missing data is minimal.
  2. Imputation: Fill in missing values using techniques like mean, median, or mode imputation using “fillna()” in pandas.
  3. Advanced imputation: Utilize more sophisticated methods like regression imputation or k-nearest neighbors imputation for more accurate estimates.
  4. Categorical imputation: For categorical data, replace missing values with the most frequent category or a separate category to indicate missingness.
  5. Domain-specific imputation: In certain cases, use domain knowledge or specific algorithms tailored to the data, such as interpolation for time series data.

Example:

  1. Remove rows with missing values:

2. Fill the missing values with a specific value

The code data.fillna(0) replaces missing values in the DataFrame data with the value 0.

It’s just an example, but we can also use various methods to fill the missing values in our dataset:

For Example:

After completing the data cleaning process, we can proceed with data preprocessing.

Data Preprocessing:

Data preprocessing involves applying techniques to raw data, such as scaling, feature selection, handling outliers, and encoding categorical variables, to prepare it for analysis and improve modeling accuracy.

Data preprocessing involves several key steps to prepare the data for analysis:

  1. Data Cleaning: Handle missing values, duplicates, and formatting issues.
Removes rows with missing values from the dataset.

2. Feature Scaling: Normalize numerical features for fair comparisons.

Scales numerical features to ensure fair comparisons.

3. Feature Selection: Choose relevant features or transform them to reduce dimensionality.

Selects the top 5 features based on their scores using the chi-squared test.

4. Handling Outliers: Identify and address outliers that can impact analysis or modeling.

Filters out the outliers in the ‘feature’ column based on specified lower and upper thresholds.

5. Encoding Categorical Variables: Convert categorical features into numerical representations.

Performs one-hot encoding for the ‘categorical_feature’ column.

These steps help ensure data quality, improve modeling accuracy, and prepare the dataset for further analysis. Now, we will proceed with data visualization on the preprocessed data.

Data Visualization:

Data visualization is the presentation of data and information in a visual format, such as charts, graphs, and maps, to effectively communicate patterns, trends, and insights. It enhances understanding, aids decision-making, and facilitates the exploration and analysis of data.

There are various types of graphs and charts used in data visualization, including:

  1. Bar Chart: Displays categorical data using rectangular bars of different heights.

When To Use ?

If you are comparing categorical data or displaying frequency counts of different categories, utilize a bar chart.

Code Example:

2. Line Chart: Shows the trend or relationship between two variables by connecting data points with lines.

When To Use ?

If you are visualizing trends over time or showcasing the relationship between two continuous variables, use a line chart.

Code Example:

3. Pie Chart: Illustrates the proportion of different categories within a whole using slices of a circle.

When To Use ?

If you want to represent the composition or distribution of categorical data with a limited number of categories, apply a pie chart.

Code Example:

4. Histogram: Presents the distribution of a continuous variable by grouping data into bins and displaying their frequencies.

When To Use ?

If you are visualizing the distribution and frequency of continuous data, use a histogram.

Code Example:

5. Box Plot: Displays the distribution of numerical data through quartiles, highlighting potential outliers.

When To Use ?

If you want to display the distribution, skewness, and outliers of numerical data, utilize a boxplot.

Code Example:

6. Heatmap: Represents data values in a matrix using colors to visualize patterns and relationships.

When To Use ?

If you want to showcase the correlation or relationship between multiple variables in a matrix-like form, apply a heatmap.

Code Example:

7. Area Chart: Depicts the cumulative magnitude of multiple variables over time, showing the overall trend.

When To Use ?

If you want to demonstrate the cumulative contribution of different variables over time, utilize an area chart.

Code Example:

8. Bubble Chart: Displays three dimensions of data by representing data points as bubbles with varying sizes and colors.

When To Use ?

If you want to display three dimensions of data, where the size and position of bubbles represent different variables, use a bubble chart.

Code Example:

9. TreeMap: Visualizes hierarchical data using nested rectangles to represent the proportion of each category.

When To Use ?

If you want to visualize hierarchical data and the relative sizes of different categories within a whole, apply a treemap.

Code Example:

10. Scatter Plot: Represents the relationship between two numerical variables using individual data points on a Cartesian plane.

When To Use ?

If you want to showcase the relationship or correlation between two continuous variables, use a scatter plot.

Code Example:

In conclusion, to gain practical insights into EDA techniques, I invite you to explore my notebook on EDA with the Netflix dataset. This comprehensive notebook covers the step-by-step process of data collection, cleaning, preprocessing, and visualization. By following along, you can enhance your understanding of structured and efficient data analysis. Take advantage of this valuable resource to further develop your knowledge and skills in EDA.

--

--