<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by TANMOY on Medium]]></title>
        <description><![CDATA[Stories by TANMOY on Medium]]></description>
        <link>https://medium.com/@paltanmoy48test?source=rss-5ee1ced60108------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*qHi57hefo4yaHIGc</url>
            <title>Stories by TANMOY on Medium</title>
            <link>https://medium.com/@paltanmoy48test?source=rss-5ee1ced60108------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Fri, 08 May 2026 00:06:44 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@paltanmoy48test/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[EDA with PYTHON]]></title>
            <link>https://medium.com/@paltanmoy48test/eda-with-python-419d757433a4?source=rss-5ee1ced60108------2</link>
            <guid isPermaLink="false">https://medium.com/p/419d757433a4</guid>
            <dc:creator><![CDATA[TANMOY]]></dc:creator>
            <pubDate>Wed, 11 Jun 2025 03:28:05 GMT</pubDate>
            <atom:updated>2025-06-11T03:28:05.008Z</atom:updated>
            <content:encoded><![CDATA[<p><strong>What is Exploratory Data Analysis ?</strong></p><p>Exploratory data analysis, or EDA, is the process of comprehending data sets by highlighting their key features, sometimes through visual graphing. This stage is crucial, particularly when it comes to modeling the data for machine learning applications.</p><h3>1. Importing the required libraries for EDA</h3><pre>import pandas as pd<br>import numpy as np<br>import seaborn as sns                       #DATA visualisation<br>import matplotlib.pyplot as plt             #DATA visualisation<br>import warnings as wr<br>wr.filterwarnings(&#39;ignore&#39;)             <br>%matplotlib inline     <br>sns.set(color_codes=True)</pre><h3>2. Loading the data</h3><p>Download the dataset from this <a href="https://media.geeksforgeeks.org/wp-content/uploads/20241227155317577868/Wine_Quality.zip">link</a></p><pre>#load and read it using pandas.<br># df using for data frame<br>df = pd.read_csv(&quot;../input/dataset/data.csv&quot;)  <br>print(df.head())<br># To display the top 5 rows <br>df.head(5)<br># To display the botton 5 rows<br>df.tail(5)<br># check the datatypes<br>df.dtypes <br>#This function is used to understand the number of rows (observations) and columns (features) in the dataset. This gives an overview of the dataset&#39;s size and structure.<br>df.shape<br>#(1143, 13) (no of rows , no of columns)  <br># This function helps us to understand the dataset by showing the number of records in each column, type of data, whether any values are missing and how much memory the dataset uses.    <br>df.info()<br>#This method gives a statistical summary of the DataFrame showing values like count, mean, standard deviation, minimum and quartiles for each numerical column. It helps in summarizing the central tendency and spread of the data.<br>df.describe() #count mean median min and max<br><br>#This converts the column names of the DataFrame into a Python list making it easy to access and manipulate the column names.<br>df.columns.tolist()            </pre><h3>Dropping the duplicate rows</h3><pre><br>#no of rows and columns<br>df.shape<br># no of duplicate rows<br>duplicate_rows_df = df[df.duplicated()]<br>print(&quot;number of duplicate rows: &quot;, duplicate_rows_df.shape)<br>df.count()      # Used to count the number of rows<br>df = df.drop_duplicates()<br>df.head(5)<br>df.count()<br>#This function tells us how many unique values exist in each column which provides insight into the variety of data in each feature.<br>df.nunique()</pre><h3>Dropping the missing or null values.</h3><pre>print(df.isnull().sum())<br>df = df.dropna()    # Dropping the missing values.<br>df.count()<br>print(df.isnull().sum())   # After dropping the values<br></pre><h3>Detecting Outliers</h3><p>An outlier is a point or set of points that are different from other points. Sometimes they can be very high or very low. It’s often a good idea to detect and remove the outliers. Because outliers are one of the primary reasons for resulting in a less accurate model.</p><pre>sns.boxplot(x=df[&#39;Price&#39;])<br>sns.boxplot(x=df[&#39;HP&#39;])<br>sns.boxplot(x=df[&#39;Cylinders&#39;])<br>Q1 = df.quantile(0.25)<br>Q3 = df.quantile(0.75)<br>IQR = Q3 - Q1<br>print(IQR)<br><br>df = df[~((df &lt; (Q1 - 1.5 * IQR)) |(df &gt; (Q3 + 1.5 * IQR))).any(axis=1)]<br>df.shape</pre><h3>Univariate Analysis</h3><h3>Univariate data:</h3><p>Univariate data refers to a type of data in which each observation or data point corresponds to a single variable. In other words, it involves the measurement or observation of a single characteristic or attribute for each individual or item in the dataset.</p><ol><li><strong>Bar Plot for evaluating the count of the wine with its quality rate.</strong></li></ol><pre>quality_counts = df[&#39;quality&#39;].value_counts()<br><br>plt.figure(figsize=(8, 6))<br>plt.bar(quality_counts.index, quality_counts, color=&#39;deeppink&#39;)<br>plt.title(&#39;Count Plot of Quality&#39;)<br>plt.xlabel(&#39;Quality&#39;)<br>plt.ylabel(&#39;Count&#39;)<br>plt.show()</pre><p><strong>2. </strong><a href="https://www.geeksforgeeks.org/kde-plot-visualization-with-pandas-and-seaborn/"><strong>Kernel density plot</strong></a><strong> for understanding variance in the dataset</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/720/1*J4PCBTBgWHq7W1EqlmNlNg.png" /></figure><pre>sns.set_style(&quot;darkgrid&quot;)<br><br>numerical_columns = df.select_dtypes(include=[&quot;int64&quot;, &quot;float64&quot;]).columns<br><br>plt.figure(figsize=(14, len(numerical_columns) * 3))<br>for idx, feature in enumerate(numerical_columns, 1):<br>    plt.subplot(len(numerical_columns), 2, idx)<br>    sns.histplot(df[feature], kde=True)<br>    plt.title(f&quot;{feature} | Skewness: {round(df[feature].skew(), 2)}&quot;)<br><br>plt.tight_layout()<br>plt.show()<br></pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/702/1*mmmlj6xptTaDj9aztEZOFA.png" /><figcaption>from GKG</figcaption></figure><p><strong>3. Swarm Plot for showing the outlier in the data</strong></p><pre>plt.figure(figsize=(10, 8))<br><br>sns.swarmplot(x=&quot;quality&quot;, y=&quot;alcohol&quot;, data=df, palette=&#39;viridis&#39;)<br><br>plt.title(&#39;Swarm Plot for Quality and Alcohol&#39;)<br>plt.xlabel(&#39;Quality&#39;)<br>plt.ylabel(&#39;Alcohol&#39;)<br>plt.show()</pre><h3>Bivariate Analysis</h3><p>In bivariate analysis two variables are analyzed together to identify patterns, dependencies or interactions between them. This method helps in understanding how changes in one variable might affect another.</p><h4>1. Pair Plot for showing the distribution of the individual variables</h4><pre>sns.set_palette(&quot;Pastel1&quot;)<br><br>plt.figure(figsize=(10, 6))<br><br>sns.pairplot(df)<br><br>plt.suptitle(&#39;Pair Plot for DataFrame&#39;)<br>plt.show()<br></pre><p><strong>2. Violin Plot for examining the relationship between alcohol and Quality.</strong></p><pre>df[&#39;quality&#39;] = df[&#39;quality&#39;].astype(str)  <br><br>plt.figure(figsize=(10, 8))<br><br>sns.violinplot(x=&quot;quality&quot;, y=&quot;alcohol&quot;, data=df, palette={<br>               &#39;3&#39;: &#39;lightcoral&#39;, &#39;4&#39;: &#39;lightblue&#39;, &#39;5&#39;: &#39;lightgreen&#39;, &#39;6&#39;: &#39;gold&#39;, &#39;7&#39;: &#39;lightskyblue&#39;, &#39;8&#39;: &#39;lightpink&#39;}, alpha=0.7)<br><br>plt.title(&#39;Violin Plot for Quality and Alcohol&#39;)<br>plt.xlabel(&#39;Quality&#39;)<br>plt.ylabel(&#39;Alcohol&#39;)<br>plt.show()</pre><p>For interpreting the <a href="https://www.geeksforgeeks.org/violin-plot-for-data-analysis/">Violin Plot</a>:</p><ul><li>If the width is wider, it shows higher density suggesting more data points.</li><li>Symmetrical plot shows a balanced distribution.</li><li>Peak or bulge in the violin plot represents most common value in distribution.</li><li>Longer tails shows great variability.</li><li>Median line is the middle line inside the violin plot. It helps in understanding central tendencies.</li></ul><p><strong>3. Box Plot for examining the relationship between alcohol and Quality</strong></p><pre>sns.boxplot(x=&#39;quality&#39;, y=&#39;alcohol&#39;, data=df)</pre><p>Box represents the <a href="https://www.geeksforgeeks.org/interquartile-range-iqr/">IQR</a> i.e longer the box, greater the variability.</p><ul><li>Median line in the box shows central tendency.</li><li><a href="https://www.geeksforgeeks.org/box-and-whisker-plot-meaning-uses-and-example/">Whiskers</a> extend from box to the smallest and largest values within a specified range.</li><li>Individual points beyond the whiskers represents outliers.</li><li>A compact box shows low variability while a stretched box shows higher variability.</li></ul><h3>Multivariate Analysis</h3><p>It involves finding the interactions between three or more variables in a dataset at the same time. This approach focuses to identify complex patterns, relationships and interactions which provides understanding of how multiple variables collectively behave and influence each other.</p><pre>plt.figure(figsize=(15, 10))<br><br>sns.heatmap(df.corr(), annot=True, fmt=&#39;.2f&#39;, cmap=&#39;Pastel2&#39;, linewidths=2)<br><br>plt.title(&#39;Correlation Heatmap&#39;)<br>plt.show()</pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=419d757433a4" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>