Everything You’ve Ever Wanted to Know About Exploratory Data Analysis (EDA) Part — 1

Soukhindra Nath Basak
Geek Culture
Published in
7 min readApr 27, 2021

If you can’t explain it simply, you don’t understand it well enough.

— Albert Einstein, Physicist

Exploratory data analysis is providing that ability to simplify the data, reveals important insights, and often provides hidden patterns from a data set by analyzing and investigating the data.

It is the first and foremost step to analyze any kind of data and try to come up with a hypothesis about it which we can later test using hypothesis testing. Statisticians use it to take a bird’s eye view of the data and try to make some sense of it.

Why Exploratory Data Analysis Important?

EDA helps to look at the data before making any assumptions. It helps to better understand the patterns within the data, find relations among the variables, identify obvious errors, as well as to detect outliers within the data.

Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.

Steps to Perform EDA

  1. Data sourcing

Data comes from various sources like the large amount of data collected by the government or other public agencies is made public for research. In data sourcing, the first task is to gather all the information from the following source of data

a) Public Data: This type of data is publicly available by the government or other public agencies for the research or other purpose. These data can be fetched out directly as a dataset or have to manually extracted and converted into the format

For Example - https://data.gov.in/

b) Private Data: This type of data is sensitive to the organization or person which is why this type of data cannot be available publicly. Moreover, mostly these types of data provide customer-centric or organizational-centric data which helps to optimize the daily process of the organization and enhance the customer experience.

2. Data cleaning

After procuring all the data from various resources, Most of the cases data have major quality issues like Missing values, Wrong data or format, duplicate data, and many more. These issues lead to errors or irrelevant output, so it is necessary to handle these types of issues and fix the quality. Quality check is one of the most crucial and most time-consuming processes in all over data analyzing.

The data cleaning process can be varied as per the data set quality and requirements for improvement. Still, while data cleaning the following are the steps that a data analyst may keep in mind for reference:

a) Fixing Rows: Delete incorrect rows which degrade the quality of the dataset. Fixing of rows can be handled by deleting incorrect rows, deleting summary rows, and deleting extra rows which will not provide any insight into the data set.

For Example: In figure A, Row 5 the data is corrupted so it is recommended to remove this kind of row. As mentioned earlier regarding summary rows in figure A, row 7 has to delete for fixing the rows.

Figure — A

b) Fixing Columns: Now when the row data is fixed, Some of the columns may also be needed to fix to increase the quality of the data set. Fixing of columns can be handled by deleting the unnecessary column, split columns for more insights, add column names if missing, Align misaligned columns, merge columns for identifiers.

For Example:

  • In Figure B, Row 4 the data is misaligned and for fixing it columns need to be shifted left by one for row 5 only.
  • Again, in Figure B, Column B as per the table row name is missing these types of issue can be fixed by providing a name by observing the column data, in the current example we can provide the name as ‘First Name’ as it holds the first name of the employee.
  • Columns can be merged for identifiers or minimize the data example we can merge ‘First name’, ‘Last Name’ and create a new column as ‘Full Name’
  • Columns can also be split for getting relevant data from a single column, for example, the ‘Portfolio’ column can be split for fetching the domain detail that employees used for showcasing the portfolio.
Figure — B

3. Fix Missing Values

a) Set values as missing values: Identify all the missing values such as blank strings, “NA”, “XX”, “999”, etc.

b) Adding is good, exaggerating is bad: Get information from reliable external sources as much as possible, but if you can’t, then it is better to keep missing values as such rather than exaggerating the existing rows/columns.

c) Delete rows, columns: Rows or columns could be removed if the missing values are quite significant, because treating a large number of missing data may lead to incorrect results. Therefore it is better to delete rows and columns if necessary.

d) Fill partial missing values using business judgment: Fill missing data by various methods like mean, median, mode, etc. The process of filling can be changed as per the behavior of the data

4. Standardise Values

Standardized value is the next step that needs to take care of properly by understanding the dataset. This helps to convert the data into a common format to enable users to process and analyze it

The following are the steps that need to be taken care of to standardize the values:

a) Standardize units: Ensure all variables have a common and consistent unit, e.g. miles/hr to km/hr, etc.

b) Scale values if required: Make sure the observations under a variable have a common scale Standardise precision for a better presentation of data, e.g. 3.14159 to 3.14.

c) Remove outliers: Remove high and low values if required that would disproportionately affect the results of your analysis

5. Fixing Invalid Values

Finding invalid values may cause big trouble because data may be syntactically invalid or semantically invalid, or may provide junk data because of encoding type.

There are lots of ways that data may not provide proper values some of the examples are as follows:

  • Fixing Incorrect data types: Fixing incorrect data types is also necessary for analysis, for example, numeric data is saved as a string like ‘93,500’ which needs to be saved as 93500
  • Encode Unicode properly: Dataset may provide junk results the solution to this type of issue is to change the encoding type.
  • Fixing Incorrect Values: Data might have semantic issues also in data as mentioned earlier. This type of error may be fixed by removing the row, Changing the type of some values of the column.

For Example, In Figure C, Row 4: Column E As per logic, humans cannot handle 98.6°C. So per the understanding, it might happen that someone added the wrong data or the type of the data is wrong, in the example, the ‘98.6’ value may indicate that the particular data is been written in Fahrenheit. So data may be removed which go beyond the range or we can convert the data in the desired result as here 98.6(Fahrenheit ) will convert to 37°C

Figure — C
  • Correct values not in the list: Remove values that don’t belong to a list. E.g. In a data set containing blood groups of individuals, strings “E” or “F” are invalid values and can be removed.
  • Correct wrong structure: Values that don’t follow a defined structure can be removed. For example, In a data set containing pin codes of Indian cities, a pin code of 10 digits would be an invalid value and needs to be removed.

6. Filter Data

Filtering data helps to minimize the dataset and filter out the whole data set as per relevance.

The steps of filtering are as follows:

  • Deduplicate data: Remove duplicate rows or columns.
  • Filter rows: Filter row which relevant to the analysis example business team needs insight regarding 2020 data. In this case, all other than 2020 data have to be doped from the dataset
  • Filter columns: Filter the columns which are relevant to the analysis 
  • Aggregate data: Group by required keys, aggregate the rest

This is the end of part one and the completion of data sourcing, cleaning, fixing, and filtering of the data. Hope that the content is sufficient enough to comprehend the concept and process of EDA. Part 2 will cover the data visualization technique and the rest of the Exploratory Data Analysis topics.

--

--