Processes Involved in Data Wrangling and Exploratory Data Analysis

Jayanta Parida
2 min readSep 5, 2022

--

Data scientists spend 80% of their time performing data wrangling and exploratory data analysis. In this blog, we will see the basic processes behind data wrangling and exploratory data analysis.

Data Wrangling is a process of transforming raw data into a more useful format. It involves various steps to transform data by working on the aspects like 1. Encoding 2. Missing Information 3. Formatting 4. Duplicate data etc.

Exploratory Data Analysis is a process of analyzing data to gain valuable insights such as statistical summary and visualization.

Here is a list of key data wrangling tasks which are generally performed on data used for data science problems.

  1. Selecting important information from raw data — Getting meaningful columns from the raw data provided.
  2. Removing unnecessary information — Discarding the data which does not have any role in modeling.
  3. Adding domain knowledge to improve data quality — For example, applying domain concepts to derive meaningful data fields.
  4. Merging several data sources into a single dataset — Let’s say we have multiple data sources which are related to the analysis problem, we need to merge data into a single, so we can analyze easily with a single data frame. It can also include the merging of various fields into a single meaningful field like first name and last name to name.
  5. Dealing with missing fields — There could be some rows that have missing information for the dataset, in that we need to apply various techniques to fill in those fields, like mean of age for missing age data, or discard the row on the basis of missing fields.
  6. Performing Encoding — Here we encode the column values to numerical data.
  7. Perform scaling which includes normalization and standardization — To ensure features within the same scale.
  8. Filtering data — It is used to define certain criteria to find out meaningful insights, for example, people age more than 30 are married and less than unmarried. Similarly, we can use n number of criteria in various fields to get different insights.

Feature Scaling is required before training machine learning models to ensure that features are within the same scale. There are two ways to achieve feature scaling.

Normalization is done to make feature values range from 0 to 1.

Standardization is done to transfer the data to have a mean of 0 and standard deviation of 1, but can have any upper and lower values.

One-Hot Encoding works by converting the values of a column with 1’s and 0’s in them. Basically, we will be converting categorical data into numerical data as data science models deal with numbers. So one question arises here, can we simply replace categorical data with integer values directly instead of 1,0 encoding? Yes, but that will lead to an order issue. When the model will see numeric values, it will assume one category is of a higher order than another due to their numeric values 1,2,3, etc.

Finally, we can plot a histogram to do the correlation analysis as part of exploratory data analysis. Also, we can use various visualization methods to get insights from exploratory data analysis.

--

--