<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Gayathri Gopalsami on Medium]]></title>
        <description><![CDATA[Stories by Gayathri Gopalsami on Medium]]></description>
        <link>https://medium.com/@gayathri_g21?source=rss-35e2e2a8dc09------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*yjzxsR-bFJjnURkrANqJLg.png</url>
            <title>Stories by Gayathri Gopalsami on Medium</title>
            <link>https://medium.com/@gayathri_g21?source=rss-35e2e2a8dc09------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Wed, 27 May 2026 00:56:59 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@gayathri_g21/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Central Limit Theorem: Let’s learn with an example using Python]]></title>
            <link>https://medium.com/@gayathri_g21/central-limit-theorem-lets-learn-with-an-example-using-python-4801b8d1c6b5?source=rss-35e2e2a8dc09------2</link>
            <guid isPermaLink="false">https://medium.com/p/4801b8d1c6b5</guid>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[ml-so-good]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Gayathri Gopalsami]]></dc:creator>
            <pubDate>Wed, 02 Feb 2022 17:49:18 GMT</pubDate>
            <atom:updated>2022-02-02T18:00:26.888Z</atom:updated>
            <content:encoded><![CDATA[<p>A Probability Theory</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/626/1*fA9j-SNLn-zMJc4l_ocpkQ.png" /><figcaption>A small piece of information can tell the whole story</figcaption></figure><p><strong>Central Limit Theorem (CLT)</strong> states that if we have a large <strong><em>population </em></strong>that may or may not follow a Gaussian (Normal) Distribution; when we take <strong><em>random samples</em></strong> from it, the <strong><em>sample means</em></strong> will always follow Gaussian (Normal) Distribution. We will try to understand this statement with the help of an example.</p><p>Before we do that, let us first answer the below 3 questions.</p><ol><li><strong><em>What is Gaussian/Normal Distribution?</em></strong></li></ol><p>It is the <strong><em>symmetric bell-shaped</em></strong> curve formed from a dataset where the probability of occurrence of data points is <strong><em>more frequent near the mean</em></strong> (highlighted in grey) and <strong><em>less frequent farther away from the mean</em></strong> (highlighted in blue).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/742/1*MX_4W0BnyXGR9utQD0_lLw.png" /><figcaption>Gaussian/Normal Distribution</figcaption></figure><p><strong><em>2. What are population and population mean?</em></strong></p><p>The “<strong>population</strong>” is the entire dataset collected based on a common feature common characteristics which can be used for statistical purposes.</p><p>For example- The dataset with the weights of all the fishes in the sea.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/593/1*DEyYZrSbMiNnnTVcbHLdmw.png" /><figcaption>Population Depiction</figcaption></figure><p>The “<strong><em>population mean”</em></strong> is the average value calculated on the entire population.</p><p>For example- The average weights of all the fishes in the population.</p><p><strong><em>3. What are sample and sample mean?</em></strong></p><p>The “<strong><em>sample” </em></strong>is a subset of a population with fewer data points i.e. data selected randomly from a population.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/643/1*iAmeNwhjuKf6J-jblZinag.png" /><figcaption>Sample Depiction</figcaption></figure><p>The “<strong><em>sample mean”</em></strong> is the average value calculated on the sample dataset.</p><p>For example- The average weight of all the fishes in the sample.</p><p>As we now know, what is <strong>population</strong>, <strong>sample, </strong>and <strong>gaussian distribution</strong>;<strong> </strong>let’s understand the <strong>Central Limit Theorem </strong>with help of an example dataset.</p><p>This <a href="https://data.world/montgomery-county-of-maryland/da04581d-180f-4008-bb0d-70d830f9e673">dataset</a> which is used in this example consists of Salaries of Employees in 2019. The base salary data in this dataset will be our population and we will select random sample from this population to calculate the sample mean.</p><p>We will use <em>pandas, matplotlib, and sns</em> modules in python to load and analyze the data.</p><p>Import the library modules and load the dataset using pandas</p><pre>import pandas as pd<br>import matplotlib.pyplot as plt<br>import seaborn as sns<br>%matplotlib inline<br>import warnings<br>warnings.filterwarnings(&quot;ignore&quot;)<br>population= pd.read_csv(&#39;employee_salary.csv&#39;)<br>population</pre><p>Below is the snapshot of the dataset which has 10105 rows and 8 columns.</p><p>For understanding CLT, we will just concentrate on the “Base Salary” column and call it the salary of the entire population.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*momC5QmOvKv4ae5nJyYjLg.png" /></figure><pre>population_base_salary = population[&#39;Base Salary&#39;]</pre><p>Let’s calculate the population mean of the Base Salary and plot the histogram to check the distribution</p><pre>print(&quot;Population Mean&quot;)<br>print(&quot;---------------&quot;)<br>print(population_base_salary.mean())<br>sns.distplot(population_base_salary, color=&#39;grey&#39;)<br>plt.xlabel(&#39;Population Base Salary&#39;)<br>plt.ylabel(&#39;Probability Density&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/688/1*O0vRfxx0UCELZENQmRTJgg.png" /><figcaption>Population Distribution of Base Salary</figcaption></figure><p><strong><em>The above figure shows that the distribution is slightly asymmetric i.e. does not exactly follow the Gaussian Distribution.</em></strong></p><p>Now we will randomly select the sample dataset with a constant number of samples and calculate the mean. We repeat this process 500 times to obtain 500 sample means and then plot the distribution.</p><p>To make it easier, we can use the below function <strong>calc_sample_mean</strong> which takes the input as “sample_size” and “no_of_sample_means” i.e. the number of “sample mean” to be calculated. This function calculates the sample mean every time the sample is selected randomly and returns it in an array.</p><p>If we pass <em>sample_size =2</em> and <em>no_of_sample_means=500</em> , this function will pick 2 random “Base_Salary” samples from the dataset, calculate the sample mean and store it an array. It will repeat this process to store 500 such sample means and return the 500 stored sample mean array.</p><pre>mean = []<br>def calc_sample_mean(sample_size, no_of_sample_means):<br>    for i in range(no_of_sample_means):        <br>        sample_base_salary = population_base_salary.sample(n=sample_size)<br>        sample_mean=sample_base_salary.mean()<br>        mean.append(sample_mean)<br>    return mean</pre><p>Let’s use the function to calculate mean and plot the distribution for <strong>sample_size=2</strong></p><pre>mean_2=calc_sample_mean(sample_size=2, no_of_sample_means=500)<br>sns.distplot(mean_2, color=&#39;b&#39;)<br>plt.xlabel(&#39;Sample Base Salary (Sample size =2)&#39;)<br>plt.ylabel(&#39;Probability Density&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/651/1*oIgquwOEDAQs_KTvf4rBsg.png" /><figcaption>Distribution Curve from Sample Mean with Sample Size 2</figcaption></figure><p>There it is !!</p><p><strong><em>We got a Gaussian/Normal Distribution curve with slight skewness.</em></strong></p><p>This proves the theorem- if we have a large <strong><em>population </em></strong>that may or may not follow a Gaussian (Normal) Distribution and when we take <strong><em>random samples </em></strong>from it, the <strong><em>sample means</em></strong> will always follow Gaussian (Normal) Distribution.</p><p>Let us check by following the same process but by increasing the sample_size.</p><p><strong>sample_size=3</strong></p><pre>mean_3=calc_sample_mean(sample_size=3, no_of_sample_means=500)<br>sns.distplot(mean_3, color=&#39;r&#39;)<br>plt.xlabel(&#39;Sample Base Salary Mean(Sample size =3)&#39;)<br>plt.ylabel(&#39;Probability Density&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/611/1*Q9MlbjWJ1DY1aHstTUS4mQ.png" /><figcaption>Distribution Curve from Sample Mean with Sample Size 3</figcaption></figure><p><strong>sample_size=10</strong></p><pre>mean_10=calc_sample_mean(sample_size=10, no_of_sample_means=500)<br>sns.distplot(mean_10, color=&#39;y&#39;)<br>plt.xlabel(&#39;Sample Base Salary Mean(Sample size =10)&#39;)<br>plt.ylabel(&#39;Probability Density&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/636/1*-OEXaf4xpTmb6YCF3LU8Vg.png" /><figcaption>Distribution Curve from Sample Mean with Sample Size 10</figcaption></figure><p><strong>sample_size=20</strong></p><pre>mean_20=calc_sample_mean(sample_size=20, no_of_sample_means=500)<br>sns.distplot(mean_20, color=&#39;g&#39;)<br>plt.xlabel(&#39;Sample Base Salary Mean(Sample size =20)&#39;)<br>plt.ylabel(&#39;Probability Density&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/618/1*BG5I4biUZd7j3hqEEzdiZg.png" /><figcaption>Distribution Curve from Sample Mean with Sample Size 20</figcaption></figure><p><strong>sample_size=30</strong></p><pre>mean_30=calc_sample_mean(sample_size=30, no_of_sample_means=500)<br>sns.distplot(mean_30, color=&#39;maroon&#39;)<br>plt.xlabel(&#39;Sample Base Salary Mean(Sample size =30)&#39;)<br>plt.ylabel(&#39;Probability Density&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/602/1*BUX_9Ch9QSVbbwPJAvnXMw.png" /><figcaption>Distribution Curve from Sample Mean with Sample Size 30</figcaption></figure><p><strong>sample_size&gt;30</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/586/1*3YIoZNu6e3fAuijt70UXxA.png" /><figcaption>Distribution Curve from Sample Mean with Sample Size 40</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/599/1*SAxceeGZgRv8jYAOOmTQUw.png" /><figcaption>Distribution Curve from Sample Mean with Sample Size 100</figcaption></figure><p>As we increased the sample size, the skewness reduced and the curve became sharper. Check out the code in <a href="https://github.com/gayathrig21/Medium_Examples/tree/main/Central_Limit_Theorem">GitHub</a>.</p><p>This shows that no matter what our <strong><em>population distribution curve</em></strong> is, the <strong><em>sample means will always follow Gaussian (Normal) Distribution.</em></strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*C99s62nT-TXbnwayFegruA.png" /></figure><blockquote>Sample sizes <strong>equal to or greater than 30</strong> are considered sufficient for the theorem to hold but that might not be always necessary. In our example, we can observe that we get a Gaussian Distribution with a sample size of 10 or 20 too.</blockquote><p>So one can ask, what is the practical implication of the Central Limit Theorem?</p><p>Collection and Statistical analysis of the data of the entire population are practically impossible in almost all cases. Sample data of any population can be used to draw conclusions about the overall population using CTL as we know that sample means are always normally distributed!!!!</p><p>Happy Learning!!!!</p><p><a href="https://medium.com/mlearning-ai/mlearning-ai-submission-suggestions-b51e2b130bfb">Mlearning.ai Submission Suggestions</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4801b8d1c6b5" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Confusion Matrix: Let’s learn with example]]></title>
            <link>https://medium.com/@gayathri_g21/confusion-matrix-lets-learn-with-example-1e2de1ccdb1c?source=rss-35e2e2a8dc09------2</link>
            <guid isPermaLink="false">https://medium.com/p/1e2de1ccdb1c</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[confusion-matrix]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[ml-so-good]]></category>
            <dc:creator><![CDATA[Gayathri Gopalsami]]></dc:creator>
            <pubDate>Sun, 23 Jan 2022 17:33:39 GMT</pubDate>
            <atom:updated>2022-01-24T09:15:50.919Z</atom:updated>
            <content:encoded><![CDATA[<p>Everything about the Confusion Matrix</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*87O-ezPGwliHX7_PnaTCmg.jpeg" /></figure><p>Confusion Matrix, by definition, is a table that summarizes the performance of a classification algorithm. We will try to understand this statement with an example.</p><p>Let’s take the example of our favorite game of C<em>ricket</em>. There is a Cricket Board (ABC) that forms the teams, organizes and schedules the matches for the tournament. This time the board members take a crazy decision. They announce that <strong>Chuck- a random guy </strong>is going to be the new umpire for this tournament. Then they add to it saying that Chuck has only one job to do as an umpire- <em>to decide if the batsman is “out” or “not-out” (</em>In short, he has to classify between 2 classes — “Not Out” or “Out”).</p><p>All the players including Chuck are surprised. Chuck doesn’t know anything about the game. Everyone gets worried and then starts questioning the board about this decision. Then the board explains that Chuck has to undergo training, where he should watch the previous matches and learn the pattern by himself as when a batsman is declared<strong> “out” or “not-out”</strong>. Chuck is now a little hopeful and agrees to watch all the matches. He keenly observes all the patterns and tries to understand when to decide if a player is “<strong>out</strong>” or “<strong>not out</strong>”. After learning the pattern, he feels that he is ready for the job.</p><p>The tournament begins and Chuck makes his decisions based on the pattern he has observed and learned.</p><p>Now, the board members want to see how Chuck performed with what he has learned from his training. They compare Chuck’s decision with a decision that an experienced umpire would have made. They come up with the data as in the table below. In the table, <strong>Not-Out</strong> is the positive scenario which is represented as <strong>1</strong>, <strong>Out </strong>is the negative scenario which is represented as <strong>0.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1008/1*MNRGIWS3P6VwFM3yJ1M06Q.png" /><figcaption>Experienced Umpire Vs Chuck’s decision</figcaption></figure><p>Out of 10 decisions made by Chuck, 3 of them were wrong as highlighted in Red. <em>Not bad for the first time though!!!</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*UlWZjL5ZfkCfhmBXuke_4g.png" /><figcaption>Wrong Decisions By Chuck (Highlighted in Red)</figcaption></figure><p>Let’s check his performance and analyze the implications by counting “Actual and Predicted” values.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*9tVpLLX2wv_LWypWd0LiOA.png" /><figcaption>Experienced Umpire and Chuck’s decisions highlighted with different colors</figcaption></figure><blockquote><em>Actual “Not out” — Count the number of green cells — 3.</em></blockquote><blockquote><em>Actual “Out” — Count the number of orange cells — 7</em></blockquote><blockquote><em>Chuck’s predicted “Not out” — Count the number of yellow cells — 4</em></blockquote><blockquote><em>Chuck’s predicted “Out” — Count the number of blue cells — 6</em></blockquote><p>Now we can start substituting the numbers in matrix form.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/497/1*oCqXiBpIyfrIkHnreRY99Q.png" /><figcaption>The initial version of Confusion Matrix</figcaption></figure><p>Let’s check how many “Outs” and “Not-Outs” Chuck predicted correctly. We know that he made 3 wrong decisions. This means that he has made 7 correct decisions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1006/1*Hnz6CmntAhM5_tm8loNmEw.png" /><figcaption>Correct Decisions by Chuck</figcaption></figure><blockquote><em>Correctly predicted “Not out” — Count the number of pink cells — 2.</em></blockquote><blockquote><em>Correctly predicted “Out” — Count the number of aqua green cells —5.</em></blockquote><p>After substituting in our matrix we get:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/475/1*GO_pSypWGUw69qsuP8JbHw.png" /><figcaption>The Confusion Matrix with all correctly predicted “Outs” and “Not-Outs”</figcaption></figure><p>Now we are ready to understand what are <strong>True Positives and True Negatives.</strong></p><p><strong>True Positives (TP)</strong>: Correctly predicted positive values — in our case, the number of correct decisions by Chuck as “Not-out” (Pink Cell)</p><p><strong>True Negatives (TN)</strong>: Correctly predicted negative values — in our case, the number of correct decisions by Chuck as “Out” (Aqua Green Cell)</p><p>Let’s move forward with wrong predictions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1012/1*oi4433rnTFzieh0sm4AZUQ.png" /><figcaption>Wrong Decisions By Chuck</figcaption></figure><blockquote><em>Wrongly predicted “Out” as “Not-Out” — Count the number of red cells — 2.</em></blockquote><blockquote><em>Wrongly predicted “Not out” as “Out”— Count the number of blue cells — 1.</em></blockquote><p>This completes our matrix:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/494/1*TMr0Ulhx8e6UL2Njm7HgYA.png" /><figcaption>The complete confusion matrix of Chuck’s decision</figcaption></figure><p>Here we will try to understand <strong>False Positives and False Negatives</strong></p><p><strong>False Positives (FP)</strong>: Wrongly predicted as positive values — in our case, the number of wrong decisions by Chuck where the player was actually “Out” but Chuck decided he was “Not-out” (Red Cell).</p><p><strong>False Negatives (FN)</strong>: Wrongly predicted as negative values — in our case, in our case, the number of wrong decisions by Chuck where the player was actually “Not-Out” but Chuck decided he was “Out” (Blue Cell).</p><p>It is very obvious that when we do something wrong, we call it an error:</p><p><strong>Type1 Errors</strong>: False Positives are called Type1 Errors</p><p><strong>Type2 Errors</strong>: False Negatives are called Type2 Errors</p><p>Below figure shows the Confusion Matrix representing <strong>Chuck’s decision </strong>on the left side and <strong>generic confusion matrix representation </strong>on the right side:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CAV9nrveQpZ8fnQTvTrDaw.png" /><figcaption>The Generic Confusion Matrix</figcaption></figure><p>As we have the confusion matrix now, we can now calculate Chuck’s performance scores by answering the below questions.</p><p><em>1) How accurate was Chuck in deciding whether a player was “Not-Out” or “Out”?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/972/1*f3SdAFChXKpu_EAG1vrJ3g.png" /><figcaption>Accuracy</figcaption></figure><p>He was accurate 7 out of 10 times. Hence his <strong><em>Accuracy </em></strong><em>score is 7/10= 0.7 or 70%</em></p><p><strong>Accuracy = (TP+TN) / Total</strong></p><p><em>2)How inaccurate was Chuck in deciding whether a player was “Not-Out” or “Out”?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1003/1*S1wNomTI-LH9tB_Z7_9wkA.png" /><figcaption>Misclassification</figcaption></figure><p>He was inaccurate 3 out of 10 times. Hence his <strong><em>Misclassification </em></strong><em>score is 3/10= 0.3 or 30%</em></p><p><strong>Misclassification= (FP+FN) / Total</strong></p><p><em>3)How many times did Chuck correctly decide that a player is “Not-Out” from all the “Not-Outs” he declared.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/942/1*y-Mc5hSmRkwut5-Lpp8ZPA.png" /><figcaption>Precision</figcaption></figure><p>Chuck was correct 2 times out of total 4 times that he decided that a player is “Not-Out”. Hence his <strong>Precision </strong>score is 2/4 =0.5 or 50 <strong>%</strong></p><p><strong>Precision = TP / (TP+FP)</strong></p><p>4) <em>How many times did Chuck correctly decide that a player is “Not-Out” from all the actual “Not-Outs”?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/967/1*N6e8DCmU3FndMNsvJGYaKQ.png" /><figcaption>Recall</figcaption></figure><p>Chuck was correct 2 times out of a total of 3 times that a player was actually “Not-Out”. Hence his <strong>Sensitivity /Recall </strong>score is 2/3 =0.6 or 60 <strong>%</strong></p><p><strong>Recall= TP / (TP+FN)</strong></p><p>5) <em>How many times did Chuck correctly decide that a player is “Out” from all the actual “Outs”?</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/947/1*IyM09aYXxsIlG-rco7UYHg.png" /><figcaption>Specificity</figcaption></figure><p>Chuck correctly decided 5 times out of a total of 7 times that a player was actually “Out”. Hence his Specificity<strong> </strong>score is 5/7 =0.7 or 70 <strong>%</strong></p><p><strong>Specificity= TN / (FP+TN)</strong></p><p>Let us use Python <a href="https://scikit-learn.org/stable/index.html">scikit-learn</a> to generate a “confusion matrix” and calculate performance scores:</p><pre>#Code snippet used for this example :</pre><pre>from sklearn.metrics import accuracy_score ,confusion_matrix ,precision_score , recall_score , ConfusionMatrixDisplay<br>import matplotlib.pyplot as plt</pre><pre>#Data collected by the board<br>Experienced_umpire = [1, 1, 0, 1, 0, 0, 0, 0, 0, 0]<br>Chuck = [1, 0, 0, 1, 0, 0, 1, 0, 1, 0]</pre><pre>#Plotting Confusion Matrix<br>fig, ax = plt.subplots(1,1,figsize=(7,4))<br>results = confusion_matrix(Chuck, Experienced_umpire , labels=[1,0])<br>cm_display = ConfusionMatrixDisplay(results, display_labels=[&#39;Not-out&#39;,&#39;Out&#39;]).plot(values_format=&quot;.0f&quot;,ax=ax)<br>ax.set_xlabel(&quot;Experienced Umpire&#39;s Decision&quot;)<br>ax.set_ylabel(&quot;Chuck&#39;s Decision&quot;)<br>plt.show()</pre><pre>#Printing Performance Metrix<br>print(&quot;Accuracy Score  : &quot;, accuracy_score(Experienced_umpire, Chuck))<br>print(&quot;Precision Score : &quot;, precision_score(Experienced_umpire, Chuck))<br>print(&quot;Recall Score    : &quot;, recall_score(Experienced_umpire, Chuck))</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/995/1*aPqdWVVCc0mf8Td7ViSw7w.png" /><figcaption>Chuck’s decision confusion matrix in Python</figcaption></figure><p>The score has been calculated. What should Chuck do now? He should try to increase his performance. The 3 wrong predictions that he made are going to affect the teams very badly. He should try to <strong>increase </strong>the <strong>Accuracy</strong>. Accuracy will increase by either <strong>increasing </strong>the <strong>True Positives or True Negative. </strong>This means that either <strong>False Positives or False Negative </strong>should be <strong>reduced.</strong></p><p><em>Reducing False Positive will increase Precision and reducing False Negative will increase Recall</em><strong><em>. </em></strong><em>Hence, Chuck needs to focus on either increasing Recall or Precision Score.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/625/1*y-T1k-hF2KvW_5CefqCWXQ.png" /></figure><p>In an ideal scenario, reducing both False Positives and False Negatives might NOT be possible. In those cases, we need to analyze whether reducing False Positives is more important or False Negatives must be reduced i.e.</p><p><em>Chuck declaring a “Not-Out” as “Out” should be reduced or vice versa</em></p><p>In our scenario, reducing both might be important. But Chuck declaring a “Not-Out” as “Out” could have more impact. <em>So, Chuck should focus more on reducing False Negative i.e. increasing his Recall Score.</em></p><p><strong><em>Jargons at a glance:</em></strong></p><p><strong>True Positives (TP)</strong>: Correctly predicted positive values.</p><p><strong>True Negatives (TN)</strong>: Correctly predicted negative values.</p><p><strong>False Positives (TP)</strong>: Wrongly predicted positive values.</p><p><strong>False Negatives (TN)</strong>: Wrongly predicted negative values.</p><p><strong>Accuracy = (TP+TN) / Total</strong></p><p><strong>Misclassification= (FP+FN) / Total</strong></p><p><strong>Precision = TP / (TP+FP)</strong></p><p><strong>Sensitivity/Recall= TP / (TP+FN)</strong></p><p><strong>Specificity= TN / (FP+TN)</strong></p><p>Here we go…. that’s the explanation for the statement -</p><p><em>“confusion matrix is a table that summarizes the performance of classification algorithm”.</em></p><p>Happy Learning!!!!!</p><p><a href="https://medium.com/mlearning-ai/mlearning-ai-submission-suggestions-b51e2b130bfb">Mlearning.ai Submission Suggestions</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1e2de1ccdb1c" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Reusable Python Functions in my repo to quickly develop any Machine Learning Models]]></title>
            <link>https://medium.com/@gayathri_g21/reusable-python-functions-in-my-repo-to-quickly-develop-any-machine-learning-models-7b9a3db0aef3?source=rss-35e2e2a8dc09------2</link>
            <guid isPermaLink="false">https://medium.com/p/7b9a3db0aef3</guid>
            <category><![CDATA[ml-so-good]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[sklearn]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Gayathri Gopalsami]]></dc:creator>
            <pubDate>Tue, 18 Jan 2022 04:28:35 GMT</pubDate>
            <atom:updated>2022-02-05T07:35:07.800Z</atom:updated>
            <content:encoded><![CDATA[<p>Build once use many</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/960/1*vxhk6aPz0806HAarOIAqqw.png" /><figcaption>Build Once Use Many</figcaption></figure><p>While executing any end-to-end data science project, any <em>data science professional or a student </em>has to mainly focus on problem definition, data collection, data investigation, cleansing, statistical and visual analysis, feature engineering, decision making, and model building. In the entire L<em>ifecycle of a Data Science</em> project, one step where one<em> </em>should not invest much time is writing codes for building any ML models.</p><p>This article assumes that the reader has a good understanding of ML models and how to build/implement them using the “scikit-learn (<a href="https://scikit-learn.org/stable/"><em>sklearn</em></a><em>)</em>” library in python. It is about making our code reusable so that we can develop any model without wasting much time in coding and avoid writing repeated codes.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/225/1*g_gq3URorsYZpVZlbRRl8A.png" /><figcaption>Always try to write reusable codes wherever possible</figcaption></figure><p>After a long and necessary <em>process of </em><a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis#:~:text=In+statistics%2C+exploratory+data+analysis,and+other+data+visualization+methods."><em>Exploratory Data Analysis</em></a><em> on a given dataset</em>, we split the dataset ( <em>X — independent variables , y — target variable ) </em>into training and test set using <strong>train_test_split. </strong>Now we have 4 variables returned by the function — <em>X_train , X_test , y_train and y_test.</em></p><pre>from sklearn.model_selection import train_test_split</pre><pre>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)</pre><pre>print(X_train.shape)<br>print(y_train.shape)</pre><pre>print(X_test.shape)<br>print(y_test.shape)</pre><p><strong>Note</strong>: All the <em>feature engineering and feature selection</em> processes should be performed before the below-given steps.</p><p>To implement any ML model I keep the below functions in my code repository which I reuse to develop any cl<em>assifier </em>or <em>regressor </em>model.</p><h4><strong><em>Code Snippet 1</em></strong><em>:</em></h4><p><strong><em>Initialize the data frames to store and compare the performance metrics of the models.</em></strong></p><p>I have this below code snippet to store the model performance scores in a data frame to compare the different models after the predictions.</p><pre><strong><em> # For Classifier</em></strong></pre><pre>import pandas as pd<br>import numpy as np</pre><pre>#This dataframe stores the scores from classifier models<br>df_model=pd.DataFrame(columns=[&#39;Model&#39;,&#39;Accuracy Score&#39; ,&#39;F1 Score&#39;, &#39;Precision Score&#39; , &#39;Recall Score&#39; ,&#39;ROC AUC&#39;])<br><strong>df_model_performance </strong>=df_model</pre><pre>#This dataframe stores the train and test accuracy from classifier models to compare at the end of the model building. This can also be further modified to compare the other scores such as F1 score etc<br>df_model_test_train_acc = pd.DataFrame(columns=[&#39;Model&#39; , &#39;Train Accuracy Score&#39; ,&#39;Test Accuracy Score&#39;])<br><strong>df_model_accuracy </strong>=df_model_test_train_acc</pre><pre><strong><em># For Regressor</em></strong></pre><pre>import pandas as pd<br>import numpy as np</pre><pre>#This dataframe stores the scores from regressor models<br>df_model=pd.DataFrame(columns=[&#39;Model&#39;, &#39;MAE&#39; ,&#39;RMSE&#39;, &#39;R2 Score&#39; , &#39;Adjusted R2 Score&#39;])<br><strong>df_model_performance </strong>=df_model</pre><pre>#This data frame stores the train and test &quot;adjusted R2 scores&quot; from regressor models to compare at the end of the model building. This can also be further modified to compare the other score such as MSE , RMSE  etc<br>df_model_test_train_r2 = pd.DataFrame(columns=[&#39;Model&#39; , &#39;Train Adjusted R2 Score&#39; ,&#39;Test Adjusted R2 Score&#39;])<br><strong>df_model_r2 </strong>=df_model_test_train_r2</pre><h4>Code Snippet <em>2 :</em></h4><p><strong><em>Function to obtain the best model by performing hyperparameter tuning using </em></strong><a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html"><strong><em>GridSearchCV </em></strong></a><strong><em>.</em></strong></p><p>I have defined a function “<strong>get_best_hyperparameters” </strong>which does the hyperparameter tuning using GridSearchCV by taking classifier or regressor model as input. This function returns the best model which can be used to fit and predict. This step can be skipped if one just wants to build a basic model without performing any hyperparameter tuning.</p><pre><strong><em># For both Classifier and Regressor</em></strong></pre><pre>from sklearn.model_selection import GridSearchCV <br><strong>def get_best_hyperparameters</strong>(model, params, cv_value , X_train, y_train ): <br>    search = GridSearchCV(estimator=model, param_grid=params, n_jobs=-1, verbose=1,cv=cv_value) <br>    search.fit(X_train, y_train)  <br>    print(&quot;Best Accuracy    :&quot;,  search.best_score_) <br>    print(&quot;Best Parameters  : &quot;, search.best_params_)<br>    print(&quot;Best Estimators : &quot;,  search.best_estimator_)  <br>    best_grid = search.best_estimator_<br>    <strong>return </strong>best_grid</pre><h4>Code Snippet 3:</h4><p><strong><em>Function to fit and predict the model:</em></strong></p><p>This function (for classifier and regressor) <strong>get_classifier_predictions / get_regressor_predictions </strong>takes in the model as input and returns the predicted train and test results. In case of classifier , it also returns predicted train and test probability.</p><pre><strong><em>#For Classifier</em></strong></pre><pre><strong>def get_classifier_predictions</strong>(classifier, X_train, y_train, X_test): <br>    classifier.fit(X_train,y_train)<br>    y_pred_train =classifier.predict(X_train)<br>    y_pred_test = classifier.predict(X_test)<br>    y_pred_prob_train = classifier.predict_proba(X_train)<br>    y_pred_prob_test = classifier.predict_proba(X_test)<br>    <strong>return </strong>y_pred_train, y_pred_test, y_pred_prob_train,y_pred_prob_test</pre><pre><strong><em>#For Regressor</em></strong></pre><pre><strong>def get_regressor_predictions</strong>(regressor, X_train, y_train, X_test):  <br>    regressor.fit(X_train,y_train)<br>    y_pred_train =regressor.predict(X_train)<br>    y_pred_test = regressor.predict(X_test)<br>    <strong>return </strong>y_pred_train, y_pred_test</pre><h4>Code Snippet 4:</h4><p><strong><em>Function to calculate and print the performance metrics of train and test dataset</em></strong></p><p>The function <strong>print_classifier_scores / print_regressor_scores </strong>calculates and returns the dataset with all the performance metrics scores related to a classification / regression algorithm respectively .</p><pre><strong><em># For Classifier</em></strong></pre><pre>from sklearn.metrics import accuracy_score ,confusion_matrix ,precision_score , recall_score , f1_score, plot_confusion_matrix ,roc_auc_score<br>import matplotlib.pyplot as plt                                     # Importing pyplot interface to use matplotlib<br>%matplotlib inline</pre><pre><strong>def print_classifier_scores</strong>(classifier, X_train, X_test, y_train ,y_test,y_pred_train, y_pred_test,y_pred_prob_train, y_pred_prob_test,algorithm):<br># store classifier scores for Training Dataset<br>    v_recall_score_train =  recall_score(y_train,y_pred_train)<br>    v_precision_score_train = precision_score(y_train,y_pred_train)<br>    v_f1_score_train =  f1_score(y_train,y_pred_train)<br>    v_accuracy_score_train = accuracy_score(y_train,y_pred_train)<br>    v_roc_auc_train = roc_auc_score(y_train, y_pred_prob_train[:,1])<br>    <br># print classifier scores for Training Dataset<br>    print(&#39;Train-Set Confusion Matrix:\n&#39;, confusion_matrix(y_train,y_pred_train)) <br>    print(&quot;Recall Score    : &quot;, v_recall_score_train)<br>    print(&quot;Precision Score : &quot;, v_precision_score_train)<br>    print(&quot;F1 Score        : &quot;, v_f1_score_train)<br>    print(&quot;Accuracy Score  : &quot;, v_accuracy_score_train)<br>    print(&quot;ROC AUC         :  {}&quot;.format(v_roc_auc_train))<br>    print(&quot;Predict Probability  :&quot; , y_pred_prob_train)<br>    plot_confusion_matrix(classifier, X_train , y_train , display_labels = [&quot;1&quot; , &quot;0&quot;])<br>    plt.grid(b=None)<br># store classifier scores for Testing Dataset <br>   <br>    v_recall_score_test =  recall_score(y_test,y_pred_test)<br>    v_precision_score_test = precision_score(y_test,y_pred_test)<br>    v_f1_score_test =  f1_score(y_test,y_pred_test)<br>    v_accuracy_score_test = accuracy_score(y_test,y_pred_test)<br>    v_roc_auc_test = roc_auc_score(y_test, y_pred_prob_test[:,1])<br># Print classifier scores for Testing Dataset    <br>    print(&#39;Test-Set Confusion Matrix:\n&#39;, confusion_matrix(y_test,y_pred_test)) <br>    print(&quot;Recall Score    : &quot;, v_recall_score_test)<br>    print(&quot;Precision Score : &quot;, v_precision_score_test)<br>    print(&quot;F1 Score        : &quot;, v_f1_score_test)<br>    print(&quot;Accuracy Score  : &quot;, v_accuracy_score_test)<br>    print(&quot;ROC AUC         :  {}&quot;.format(v_roc_auc_test))<br>    print(&quot;Predict Probability  :&quot; , y_pred_prob_test)<br>    plot_confusion_matrix(classifier, X_test , y_test , display_labels = [&quot;1&quot; , &quot;0&quot;])<br>    plt.grid(b=None)<br># store to append the results in dataframe for final comparison of performance <br>    df_model_test_train_acc = dict({&#39;Model&#39; : algorithm, &#39;Train Accuracy Score&#39; :v_accuracy_score_train,&#39;Test Accuracy Score&#39; :v_accuracy_score_test })<br>    df_model_performance = dict({&#39;Model&#39; : algorithm, &#39;Accuracy Score&#39; :v_accuracy_score_test, &#39;F1 Score&#39; : v_f1_score_test, &#39;Precision Score&#39; : v_precision_score_test, &#39;Recall Score&#39; :v_recall_score_test, &#39;ROC AUC&#39; : v_roc_auc_test})<br>    <br>    <strong>return </strong>df_model_test_train_acc , df_model_performance</pre><pre><strong># For regressor </strong></pre><pre>from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score<br><strong>def print_regressor_scores</strong>(regressor, X_train, X_test, y_train ,y_test,y_pred_train, y_pred_test,algorithm):<br>    <br>    # store regressor scores for Training Dataset<br>    MAE_train = mean_absolute_error(y_train, y_pred_train)<br>    RMSE_train = np.sqrt( mean_squared_error(y_train, y_pred_train))<br>    r2_score_train = r2_score(y_train, y_pred_train)<br>    # Calculating Adjusted R2 for training set<br>    SS_Residual_train = sum((y_train-y_pred_train)**2)<br>    SS_Total_train = sum((y_train-np.mean(y_train))**2)<br>    r_squared_train = 1 - (float(SS_Residual_train))/SS_Total_train<br>    adj_r_sq_train = 1 - (1-r_squared_train)*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1)<br>    <br>    # print regressor scores for Training Dataset<br>    print(&#39;MAE for training set is {}&#39;.format(MAE_train))<br>    print(&#39;RMSE for training set is {}&#39;.format(RMSE_train))<br>    print(&#39;R squared score for training set is {}&#39;.format(r2_score_train))<br>    print(&#39;Adjusted R squared score for training set is {}&#39;.format(adj_r_sq_train))<br>    <br>    # store regressor scores for Test Dataset<br>    MAE_test = mean_absolute_error(y_test, y_pred_test)<br>    RMSE_test = np.sqrt(mean_squared_error(y_test, y_pred_test))<br>    r2_score_test = r2_score(y_test, y_pred_test)<br>    # Calculating Adjusted R2 for test set<br>    SS_Residual_test = sum((y_test-y_pred_test)**2)<br>    SS_Total_test = sum((y_test-np.mean(y_test))**2)<br>    r_squared_test = 1 - (float(SS_Residual_test))/SS_Total_test<br>    adj_r_sq_test = 1 - (1-r_squared_test)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)<br>    <br>    # print regressor scores for Test Dataset <br>    print(&#39;MAE for test set is {}&#39;.format(MAE_test))<br>    print(&#39;RMSE for test set is {}&#39;.format(RMSE_test))<br>    print(&#39;R squared score for test set is {}&#39;.format(r2_score_test))<br>    print(&#39;Adjusted R squared score for testing set is {}&#39;.format(adj_r_sq_test))<br>    <br>    # store to append the results in dataframe for final comparison of performance<br>    df_model_test_train_r2= dict({&#39;Model&#39; : algorithm, &#39;Train Adjusted R2 Score&#39; :adj_r_sq_train,&#39;Test Adjusted R2 Score&#39; :adj_r_sq_test })<br>    df_model_performance = dict({&#39;Model&#39; : algorithm, &#39;MAE&#39; : MAE_test, &#39;RMSE&#39; : RMSE_test, &#39;R2 Score&#39; : r2_score_test, &#39;Adjusted R2 Score&#39; :adj_r_sq_test})<br>    <strong>return</strong> df_model_test_train_r2 , df_model_performance</pre><p>There it is!</p><p>Now I can develop any ML model and I can do the prediction, calculate the scores and compare the model performance by just giving the right model and parameters to the above functions.</p><h4>Classifier Example :</h4><p>Below example shows how to use these functions to build a <strong>Logistic Regression Model</strong>(GitHub link <a href="https://github.com/gayathrig21/Medium_Examples/tree/main/ML_Code_Repo_Article">here</a>):</p><ol><li>Set up the parameters for hyperparameter tuning and pass the initialized model to the function <strong>get_best_hyperparameters </strong>to obtain the best grid. This step is optional and also an empty parameter list can be passed.</li></ol><pre>from sklearn.linear_model import LogisticRegression<br>logreg_params = {&#39;penalty&#39; : [&#39;l2&#39;],<br>                 &#39;C&#39; : np.logspace(-1, 2, 100),<br>                 &#39;solver&#39; :[&#39;liblinear&#39;],<br>                 &#39;random_state&#39; :[42,99]<br>                 }<br>lr_best_grid= <strong>get_best_hyperparameters</strong>(LogisticRegression(), logreg_params, 5, X_train, y_train)</pre><p>2. Pass the best model to function <strong>get_classifier_predictions </strong>to get the predicted results and probability.</p><pre>y_pred_train, y_pred_test, y_pred_prob_train, y_pred_prob_test = <strong>get_classifier_predictions</strong>(lr_best_grid, X_train, y_train, X_test )</pre><p>3. Input the predicted results to function <strong>print_classifier_scores </strong>to calculate the performance metric scores and print the results.</p><pre>df_model_test_train_acc1, df_model_performance1=<strong>print_classifier_scores</strong>(lr_best_grid, X_train, X_test, y_train , y_test, y_pred_train, y_pred_test, y_pred_prob_train, y_pred_prob_test , &#39;Logistic Regression&#39;)</pre><p>4. Append the results to the dataframe to compare all the built model performance</p><pre>df_model=df_model.append(df_model_performance1,ignore_index=True )<br>df_model_test_train_acc= df_model_test_train_acc.append(df_model_test_train_acc1, ignore_index=True)</pre><h4>Regressor Example:</h4><p>Below example shows how to use these functions to build a <strong>Linear Regression Model</strong>(GitHub link <a href="https://github.com/gayathrig21/Medium_Examples/tree/main/ML_Code_Repo_Article">here</a>):</p><ol><li>Set up the parameters for hyperparameter tuning and pass the initialized model to the function <strong>get_best_hyperparameters </strong>to obtain the best grid. This step is optional and also an empty parameter list can be passed.</li></ol><pre>from sklearn.linear_model import LinearRegression<br>parameters = {&#39;fit_intercept&#39;:[True,False],  &#39;copy_X&#39;:[True, False]}<br>lr_best_grid= <strong>get_best_hyperparameters</strong>(LinearRegression(), parameters, 5, X_train, y_train)</pre><p>2. Pass the best model to function <strong>get_regressor_predictions </strong>to get the predicted results.</p><pre>y_pred_train, y_pred_test = <strong>get_regressor_predictions</strong>(lr_best_grid, X_train, y_train, X_test )</pre><p>3. Input the predicted results to function <strong>print_regressor_scores </strong>to calculate the performance metric scores and print the results.</p><pre>df_model_test_train_r2_1, df_model_performance1=<strong>print_regressor_scores</strong>(lr_best_grid, X_train, X_test, y_train , y_test, y_pred_train, y_pred_test , &#39;Linear Regression&#39;)</pre><p>4. Append the results to the dataframe to compare all the built model performance</p><pre>df_model=df_model.append(df_model_performance1,ignore_index=True )<br>df_model_r2= df_model_r2.append(df_model_test_train_r2_1, ignore_index=True)</pre><p>Now we can use these functions to develop any model by passing the model specific parameters as done for <em>Linear Regression</em> and <em>Logistic Regression</em> in the example above .</p><p><em>Yay….Happy model building !!!</em></p><p><a href="https://medium.com/mlearning-ai/mlearning-ai-submission-suggestions-b51e2b130bfb">Mlearning.ai Submission Suggestions</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7b9a3db0aef3" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Six Basic Features about any ML Algorithm a Data Scientist should definitely know ..]]></title>
            <link>https://medium.com/@gayathri_g21/six-basic-features-about-any-ml-algorithm-a-data-scientist-should-definitely-know-777d3d73e28b?source=rss-35e2e2a8dc09------2</link>
            <guid isPermaLink="false">https://medium.com/p/777d3d73e28b</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[ml-so-good]]></category>
            <category><![CDATA[data-scientist]]></category>
            <dc:creator><![CDATA[Gayathri Gopalsami]]></dc:creator>
            <pubDate>Tue, 11 Jan 2022 09:32:02 GMT</pubDate>
            <atom:updated>2022-02-04T07:28:38.901Z</atom:updated>
            <content:encoded><![CDATA[<h3>Six Basic Features about any ML Algorithm a Data Scientist should definitely know ..</h3><blockquote>Any algorithm, let alone an ML algorithm, has its own <em>purpose</em> and <em>specialized features </em>of its own<em>.</em> That’s what makes an algorithm unique in its own way!</blockquote><blockquote>An improperly implemented ML algorithm on a dataset can lead to disaster no matter how advanced and powerful it may be.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xEOLl2-qU5fQCzkzEybNag.jpeg" /></figure><blockquote>The result of “<em>blind coding</em>” is always disappointing. <em>A proper and in-depth understanding of any ML algorithm is very much necessary. Even if we are devoid of it, knowing these basics before implementing will save our day…</em></blockquote><h3><strong>1. Know the “Assumptions”</strong></h3><p>Let’s take a simple example of “<em>Linear Regression</em>” which assumes that the variables have linear relationships. If this basic assumption is not true about your dataset, then the algorithm might fail.</p><p>An ML algorithm might or might not be based on assumptions. It is very important for us to know and understand them properly. Verify all the assumptions before implementing an algorithm on the dataset. If the required conditions are not met, your algorithm just might not work!!</p><h3>2. Know the “Pros and Cons”</h3><p>One of the advantages of “<em>Support Vector Machines</em>” is, it works really well on high dimensional data. If you have a dataset that has a number of dimensions greater than the number of samples, then SVM might be your answer.</p><p>Every algorithm has its pros and cons. This should be one of the basic factors in deciding when to use an algorithm and when not to. Know all of them before you start programing using any ML algorithm on your dataset.</p><h3>3. Know if “Missing Data Handling” is required:</h3><p>“<em>KNN — K-Nearest Neighbors</em>” algorithm can’t work when data is missing. For this kind of algorithm, data needs to be manually imputed to make it work.</p><p>Data is never clean. Handling missing data is a very important step of the “EDA” process before building ML models and is always recommended. Some algorithms take care of missing values. But missing data needs to be handled or not, it should be an informed decision based on the ML algorithm we are going to use.</p><h3>4. Know if “Feature Scaling” is required:</h3><p>The is no need for scaling or normalizing data before building your model with “<em>XG-Boost</em>”</p><p>All the algorithms do not require feature scaling. A distance-based algorithm that is affected by the range of features requires scaling. Identify the algorithms that require feature scaling and do it only for the algorithms that require it.</p><h3>5. Know “Outliers” impact:</h3><p>Tree-based algorithms like “Random Forest” are robust to outliers. There is no need to handle outliers for building models by implementing such algorithms.</p><p>Outliers can mislead the algorithms which can affect the performance and can lead to a poor result. Understanding the impact of outliers on the ML algorithm is a must.</p><h3>6. Know the type of “Problem Statement” the ML algorithm can solve</h3><p>For problems such as Sentiment Analysis, Text Classification, “Naïve Bayes” tends to be the solution.</p><p>Most algorithms are built with a specific purpose. Before implementing understand the problem statement and be thorough with the dataset. Handpick the algorithm that specializes in solving the kind of problem statement and start implementing it.</p><p><a href="https://medium.com/mlearning-ai/mlearning-ai-submission-suggestions-b51e2b130bfb">Mlearning.ai Submission Suggestions</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=777d3d73e28b" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>