All you need to know about EDA in Data Science.

Shubham Kava
7 min readMar 11, 2023

--

Exploratory Data Analysis (EDA)

Hello,

Today I'm here to talk about Exploratory Data Analysis (EDA) in Data Science. EDA is a crucial step in the data science process that helps us understand our data better and make informed decisions based on its insights.

EDA is a process of analyzing and visualizing data to understand its structure, patterns, and relationships. It involves identifying the important variables in the dataset, exploring the distribution of data, and detecting outliers, missing values, Special character treatment, and other data quality issues.

In the world of data science, EDA is considered to be one of the most critical steps in the process of developing a predictive model. EDA can help us understand the characteristics of the data, which can help us determine which machine learning algorithm to use and how to preprocess the data.

Now let's discuss the key steps involved in EDA:

Data Collection: The first step in EDA is to collect the relevant data that we want to analyze. This can involve collecting data from various sources, such as databases, APIs, web scraping, surveys, experiments, observational studies, or even data collected from sensors or devices.

To examine how data is obtained, we must examine the source of the data as well as the method used to collect it. If the data is related to time, we need to look at the duration of the data to determine its relevance to our analysis.

For example, if we are analyzing data on customer behavior, data from 10 years ago may be irrelevant because consumer behavior might change fast over time. In such circumstances, we may wish to focus on more recent data that is more reflective of current trends and behaviors. Similarly, if we’re analyzing data on a specific event, such as a natural disaster, we may want to focus on data from the time period surrounding the event, rather than data from a few years before or after.

Data Cleaning: The next step is to clean the data by removing duplicates, correcting any spelling errors, and dealing with missing data. Data cleaning is an essential step in EDA, as it helps ensure that our analysis is based on accurate and consistent data.

Check the information of the dataset:

For checking information of the dataset “df.info()”

Based on the information sometimes we may have to change the features datatype.

To change the datatype of the column:

For convert to DateTime format “df[‘column_name’] = pd.to_datetime(df[‘column_name’])”

To convert column to float “df[‘column_name’] = df[‘column_name’].astype(float)”

To convert column to int “df[‘column_name’] = df[‘column_name’].astype(int)”

To convert str_col to numeric “df[‘str_column’] = pd.to_numeric(df[‘str_column’], errors=’coerce’)

Duplicates: -

For check duplicates “pd.DataFrame(df[df.duplicated()])

To check the number of duplicate records “df.duplicated().sum()”

For drop duplicates “df = df.drop_duplicates()

Null records: -

For check all null records in dataset “pd.DataFrame(df[df.isnull().any(axis=1)])”

For count number of null records in each column in dataset “df.isnull().sum()”

If one record has more than 70 percent of information missing then it’s a good option to drop that record.

Drop not relevant columns and records: -

For Drop rows with missing values “df.dropna(axis=0, inplace=True)

For Drop columns with missing values “df.dropna(axis=1, inplace=True)

For Drop null values both column and row “df = df.dropna()

Imputing Missing values: -

For Imputing Missing values as mean: — “df[‘column_name’] = df[‘column_name’].fillna(df[‘column_name’].mean())”

For Imputing Missing values as median: — “df[‘column_name’] = df[‘column_name’].fillna(df[‘column_name’].median())”

For Imputing Missing values as mode: — “df[‘column_name’] = df[‘column_name’].fillna(df[‘column_name’].mode())”

Forward-fill missing values “df.fillna(method=’ffill’, inplace=True)

Backward-fill missing values “df.fillna(method=’bfill’, inplace=True)

Interpolate missing values “df.interpolate(inplace=True)” based on the surrounding values.

Data Exploration: After cleaning the data, we can start exploring it by visualizing the data through graphs, charts, and other visualizations. This helps us to identify any patterns or relationships between variables in the data.

Check data statistical summary:

For only numerical dataset “df.describe()”

For only whole dataset “df.describe(include=’all’)”

To visualize the relationships between variables in a Pandas DataFrame, you can use the “pairplot()” function from the Seaborn library. The “pairplot()” function creates a grid of scatterplots that shows the pairwise relationships between variables in the DataFrame.

For import libreary “import matplotlib.pyplot as plt”

For import libreary “import seaborn as sns”

For plot “sns.pairplot(df)

For plot “sns.pairplot(df, hue = “*C”)

*C = Any categorical feature important in dataframe. like the titanic case study “Survived”. (less number of category for batter view)

Below is an example of Plot looks like with categorical variables.

Code for plot “sns.pairplot(df )

Code for plot “sns.pairplot(df , hue = “categorical variable")

To check the relation between variables we can plot a heat map to check the Correlation between variables.

For heatmap “sns.heatmap(df.corr(), annot=True,cmap=’coolwarm’,square=True);”

For more details, we can plot a scatterplot for each of how they are correlated.

For highally correlated variables “sns.scatterplot(x = df[‘column_1’], y = df[‘column_2’])”

Data Transformation: Sometimes we may need to transform the data to make it suitable for analysis. This can involve scaling or normalizing the data, converting categorical variables to numerical variables, or combining variables to create new features.

Scaling: it can help improve the performance of certain models

There are some ways to scale datasets like mix-max, standard scaler

For min-max scaler import library “from sklearn.preprocessing import MinMaxScaler

To create a scaler object “scaler = MinMaxScaler()

To fit and transform the data “scaled_data = scaler.fit_transform(df)

To convert the scaled data back to a DataFrame “df_scaled = pd.DataFrame(scaled_data, columns=df.columns)

For standard scaler import library “from sklearn.preprocessing import StandardScaler”

To create a scaler object “scaler = StandardScaler()

To fit and transform the data “scaled_data = scaler.fit_transform(df)

For convert the scaled data back to a DataFrame “df_scaled = pd.DataFrame(scaled_data, columns=df.columns)

Converting categorical to numerical:

Categorical data represents characteristics or attributes, such as gender, race, occupation, color, and religion that cannot be measured quantitatively.

There are several methods to convert categorical data into numerical data, including:

Label Encoding: In this method, each category is assigned a unique integer value. For example, if we have a categorical variable called “color” with categories “red”, “blue”, and “green”, we can assign them integer values 0, 1, and 2 respectively.

For Label Encoding import libreary “from sklearn.preprocessing import LabelEncoder”

For Encoding “df[‘New_Col’] = LabelEncoder().fit_transform(df[‘Cat_col’])

One-Hot Encoding: A new binary column is created for each category. Each row in the dataset is assigned a value of 1 in the column corresponding to its category and 0 in all other columns. For example, if we have a categorical variable called “color” with categories “red”, “blue”, and “green”, we can create three new binary columns called “color_red”, “color_blue”, and “color_green”.

For One-Hot Encoding “one_hot_encoded = pd.get_dummies(df[‘cat_col’], prefix=’cat_col’)”

And

Add that One-Hot Encoded table to data frame “df = pd.concat([df, one_hot_encoded], axis=1)

And

Drop original “df.drop(cat_col, axis=1, inplace=True)”

Binary Encoding: Categorical variable is first converted into numerical data using label encoding. Then, each integer value is converted into binary code, and the binary digits are used as features. For example, if we have a categorical variable called “color” with categories “red”, “blue”, and “green”, we can first assign them integer values 0, 1, and 2 respectively, and then convert them into binary code: 0 = 00, 1 = 01, 2 = 10. The resulting features would be “color_0_0”, “color_0_1”, “color_1_0”, and so on.

For insall libreary “pip install category_encoders”

For Binary Encoding import libreary“import category_encoders as ce”

For import Excoding

encoder = ce.BinaryEncoder(cols=’Col_nam’)

binary_encoded = encoder.fit_transform(df[‘Col_nam’])

To add to original dataset “df = pd.concat([df, binary_encoded], axis=1)

Drop original “data.drop(categorical_var, axis=1, inplace=True)

Create a new feature: It’s totally dependent on the data set and based on the analysis you want to do.

Here are some examples of adding features to our dataset. If we have a dataset related to real estate. To calculate the age of a house based on its built date, you can subtract the built date from the current date.

For take difference “now = datetime.now()”, “ df[‘age_of_house’] = (now — df[‘built_date’]).astype(‘<m8[Y]’)

Data Modeling: Once we have completed the above steps, we can then start building models to predict the outcome based on the data. This involves selecting the appropriate machine learning algorithm and tuning its parameters to achieve the best results.

In conclusion, EDA is a crucial step in data science that helps us understand our data and make informed decisions based on its insights. By following the steps I have outlined, you can perform EDA on your own data and use the insights to build effective predictive models.

Thank you for reading! If you found this post helpful, please share it with your friends and colleagues so they can benefit from it as well! To stay up-to-date with my latest posts, don’t forget to subscribe and follow me on YouTube, LinkedIn, Tableau Public, and GitHub. I would love to connect with you and hear your thoughts on this topic, so let’s keep the conversation going!

#Programming
#Education
#Statistics

#Data Science
#python

--

--

Shubham Kava

Enthusiastic for data science and Business analysis. Able to solve a business problem based on past dataset and predict business aspects.