<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by narcis teodoroiu on Medium]]></title>
        <description><![CDATA[Stories by narcis teodoroiu on Medium]]></description>
        <link>https://medium.com/@narcis.teodoroiu?source=rss-b66498019f16------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*_cjnTjabSb-o0s5v9wrGbg.jpeg</url>
            <title>Stories by narcis teodoroiu on Medium</title>
            <link>https://medium.com/@narcis.teodoroiu?source=rss-b66498019f16------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 24 May 2026 02:12:09 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@narcis.teodoroiu/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Regression: Goodness-of-Fit Measures]]></title>
            <link>https://medium.com/@narcis.teodoroiu/regression-goodness-of-fit-measures-6b31fa7d7cbd?source=rss-b66498019f16------2</link>
            <guid isPermaLink="false">https://medium.com/p/6b31fa7d7cbd</guid>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[measurement]]></category>
            <category><![CDATA[regression-metric]]></category>
            <category><![CDATA[regression]]></category>
            <dc:creator><![CDATA[narcis teodoroiu]]></dc:creator>
            <pubDate>Wed, 29 Sep 2021 14:36:21 GMT</pubDate>
            <atom:updated>2021-09-30T10:17:59.294Z</atom:updated>
            <content:encoded><![CDATA[<blockquote>We technically <em>can</em> inspect all of the residuals to judge the model’s accuracy, but unsurprisingly, this does not scale if we have thousands or millions of data points. Thus, statisticians have developed summary measurements that take our collection of residuals and condense them into a <em>single</em> value that represents the predictive ability of our model.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OW4jsdnYRcCqis_p4MNTCQ.jpeg" /><figcaption>Source: CHUTTERSNAP from Unsplash</figcaption></figure><p>None of the following measurements is completely sufficient on its own, so <strong>to measure our model we will need several measurements</strong>. Although there are many more, we will focus on the following:</p><p><strong>Index</strong></p><ol><li><strong>Standard deviation</strong></li><li><strong>SSR</strong></li><li><strong>SSE</strong></li><li><strong>SST</strong></li><li><strong>R²</strong></li><li><strong>Adjusted-R²</strong></li><li><strong>MAE</strong></li><li><strong>MSE</strong></li><li><strong>RMSE</strong></li><li><strong>MAPE</strong></li><li><strong>MPE</strong></li><li><strong>WMAPE</strong></li></ol><h3>Standard Deviation (<a href="https://en.wikipedia.org/wiki/Sigma"><strong>σ</strong></a>)</h3><p>In statistics, the <strong>standard deviation</strong> is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.</p><p><strong>High standard deviation</strong> → data is widely spread (<strong>less reliable</strong>)</p><p><strong>Low standard deviation</strong> → data are clustered closely around the mean (<strong>more reliable</strong>)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/306/1*K0wEW-QpPqPIl9S_rl2TOg.png" /><figcaption>Source: <a href="https://vitalflux.com/standard-deviation-sample-population-python-code/">https://vitalflux.com/standard-deviation-sample-population-python-code/</a></figcaption></figure><h3>SSE (Sum of squared error-residual)</h3><p>The error is the <strong>difference </strong>between the <strong><em>observed</em> value</strong> and the <strong><em>predicted</em> value (yi).</strong></p><p><strong>Is the <em>unexplained variation</em></strong> and represents the portion of the total variation that is not explained by the regression line.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1faMfopVIpmMJA782SGCQw.png" /><figcaption>Source: Image uploaded by the Author</figcaption></figure><p>Formulas comparation:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/409/1*EhpcogW7GS7J05edN4m1OA.png" /><figcaption>Source: <a href="https://towardsdatascience.com/explain-linear-regression-with-manual-calculation-1622affdce6b">https://towardsdatascience.com/explain-linear-regression-with-manual-calculation-1622affdce6b</a></figcaption></figure><h3>SSR (Sum of squares regression)</h3><p>It is the sum of the <strong>differences </strong>between the <strong><em>predicted value</em></strong> and the <strong>mean</strong> <strong>of the <em>dependent variable</em>.</strong></p><p>Is the <em>explained variation</em> and represents the portion of the total variation that is explained by the regression line.</p><p>Think of it as a <strong>measure that describes how well our line fits the data.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*M42VGktxyMYflXa0z_6CpA.png" /><figcaption>Source: Image uploaded by the Author</figcaption></figure><h3>SST (Sum of squares total)</h3><p>Is the squared <strong>differences between the observed <em>dependent variable</em> and its mean</strong>.</p><p>Total variation in the data:</p><p><em>Total Variation(SST) = Explained Variation(SSR) + Unexplained(SSE) Variation</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TglyPSDdLfDrbIqC9lET4w.png" /><figcaption>Source: Image uploaded by the Author</figcaption></figure><h3><strong><em>R² or Coefficient of Determination</em></strong></h3><p><strong>Shows how well terms (data points) fit a curve or line.</strong></p><p>Is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.</p><p><strong>Cons</strong>:</p><ul><li>R² <strong>increases with increasing terms even though the model is not actually improving.</strong></li><li><strong>Cannot determine whether the estimates and predictions of the coefficients are biased, and that is why the residual plots must be examined.</strong></li></ul><p>Between -∞ y 1.</p><p>If the model is negative, it is worse than predicting the mean.</p><p><em>Example: if the R² of a model is 0.50, then approximately half of the observed variation can be explained by the model’s inputs.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/207/0*BfGR7k7dkeiC8Nda" /></figure><h3><strong><em>Adjusted R²</em></strong></h3><p>Also indicates how well terms fit a curve or line but adjusts for the number of terms in a model.</p><p><strong>Pro</strong>:</p><ul><li><strong>It will increase if we add the useful terms and it will decrease if we add less useful predictors.</strong></li></ul><p>Between -∞ y 1.</p><p>If the model is negative, it is worse than predicting the mean.</p><p><strong>Tells us how good/bad a model is.</strong></p><p><em>Example: if a model has adjusted R² equal to 0.05 then it is definitely bad.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/512/0*eJviXt0-9TcWAqiX" /></figure><h3><strong><em>MAE (Mean Absolute Error — L1 Loss)</em></strong></h3><p><strong>Average of the difference</strong> between the original values and the predicted values.</p><p>If we want a metric just to compare between two models from an interpretation point of view, then MAE may be a better choice.</p><p><strong>Smaller the MAE, better is the model.</strong></p><p>Range (0, + infinity]</p><p>Minimizing the absolute error (𝐿1) results in finding its <strong>median</strong>.</p><p>Taking only the absolute value of each so that negative and positive residuals do not cancel out.</p><p><strong>Pros</strong>:</p><ul><li><strong>Robust to outliers. </strong>Averaging absolute values makes MAE more robust to outliers.</li><li><strong>MAE has the same units as target values.</strong></li><li><strong>Easily interpretable.</strong></li></ul><p><strong>Cons</strong>:</p><ul><li>Because we use the absolute value of the residual, the MAE does not indicate <strong>underperformance</strong> or <strong>overperformance</strong> of the model.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/951/0*DavqytPjczDhXQ7R" /><figcaption>Source: <a href="https://www.dataquest.io/blog/understanding-regression-error-metrics/">https://www.dataquest.io/blog/understanding-regression-error-metrics/</a></figcaption></figure><h3><strong><em>MSE (Mean Square Error — L2 Loss)</em></strong></h3><p><strong>The most commonly used regression loss function.</strong></p><p><strong>MSE will almost always be bigger than the MAE.</strong></p><p>Take the <strong>average of the square of the difference</strong> between the original values and the predicted values.</p><p>While each residual in MAE contributes <strong>proportionally</strong> to the total error, the error grows <strong>quadratically</strong> in MSE. This ultimately means that <strong>outliers in our data will contribute to much higher total error</strong> in the MSE than they would the MAE.</p><p>Minimizing the squared error (𝐿2) over a set of numbers results in finding its <strong>mean</strong></p><p><strong>Use if the outliers we should care about them.</strong></p><p>Range (0, + infinity]</p><p><strong>Cons</strong>:</p><ul><li>MSE does <strong>not have the same units as target values</strong>.</li><li><strong>Not robust to outliers</strong>.</li><li>Before applying MSE, <strong>we must eliminate all nulls/infinites</strong> from the input.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/959/0*nN01DV2ZWuudybo8" /><figcaption>Source: <a href="https://www.i2tutorials.com/differences-between-mse-and-rmse/">https://www.i2tutorials.com/differences-between-mse-and-rmse/</a></figcaption></figure><h3><strong><em>RMSE (Root Mean Square Error)</em></strong></h3><p>The RMSE is<strong> analogous to the standard deviation and is a measure of how large the residuals are spread out.</strong></p><p>RMSE have the same units as target values.</p><p>Generally, RMSE will be higher than or equal to MAE.</p><p>Range (0, + infinity]</p><p><strong>Pros</strong>:</p><ul><li>RMSE <strong>have the same units as target values</strong>.</li><li><strong>Tell us how good/bad a prediction accuracy is. </strong>Not how good/bad is the model. (As a difference with Adjusted R² )</li></ul><p><strong>Cons:</strong></p><ul><li>Since the MSE and RMSE both square the residual, they are similarly <strong>affected by outliers</strong>. RMSE gives a relatively high weight to large errors due to the fact that the residual is squared before averaging.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/402/0*W3soyVFTVLxF6pK8" /><figcaption>Source: <a href="https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e">https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e</a></figcaption></figure><h3><strong><em>MAPE (Mean Absolute Percentage Error)</em></strong></h3><p><strong>Is the percentage equivalent of MAE.</strong> The equation looks just like that of MAE, but with adjustments to convert everything into percentages.</p><p>Is a <strong>method of forecast error calculation that removes negatives from the equation.</strong></p><p>MAPE is how far the model’s predictions are off from their corresponding outputs on average.</p><p><strong>Pros</strong>:</p><ul><li><strong>Has a clear interpretation</strong> since percentages are easier for people to conceptualize.</li><li>As MAE, <strong>robust to the effects of outliers</strong> thanks to the use of absolute value.</li></ul><p><strong>Cons</strong>:</p><ul><li>We are more limited in using MAPE than we are MAE.</li><li>Many of MAPE’s<strong> weaknesses actually stem from use division operation.</strong></li><li>Now that we have to scale everything by the actual value, MAPE is <strong>undefined for data points where the value is 0.</strong></li></ul><p><em>Example:</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/395/1*9dLodQWeLUJ6HEn96URQWw.png" /><figcaption>Source: <a href="https://www.aindhae.com/2019/12/cara-menghitung-mean-absolute.html">https://www.aindhae.com/2019/12/cara-menghitung-mean-absolute.html</a></figcaption></figure><h3><strong><em>MPE (Mean Percentage Error)</em></strong></h3><p>The mean percentage error (MPE) equation is<strong> exactly like that of MAPE</strong>. The <strong>only difference</strong> is that it <strong>lacks the absolute value operation</strong>.</p><p><strong>Tell us if there’s more positive errors than negative, or vice-versa.</strong></p><p>You can’t use MPE in the same way as MAPE.</p><p>Pros:</p><ul><li>If there are more negative or positive errors, this bias will show up in the MPE.</li><li>Unlike MAE and MAPE, MPE is useful to us because it allows us to see if our model systematically <strong>underestimates</strong> (more negative error) or <strong>overestimates</strong> (positive error).</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/873/0*rKWAYaEvWCFRvGdq" /><figcaption>Source: <a href="https://www.dataquest.io/blog/understanding-regression-error-metrics/">https://www.dataquest.io/blog/understanding-regression-error-metrics/</a></figcaption></figure><h3><strong><em>WMAPE (Weight Mean Percentage Error)</em></strong></h3><p><strong>It is a measure of prediction accuracy of a forecasting method.</strong></p><p>This metric is very popular and also highly recommended for use.</p><p><strong>Pro</strong>:</p><ul><li>The advantage of this metric over MAPE is that this <strong>overcomes the ‘infinite error’ issue.</strong></li></ul><h3>Thanks for reading this far!</h3><p>I hope you found this insightful and helps you in your data science career :) If you enjoyed the content, be sure to follow me on Medium. As always, I wish you the best in your learning endeavors!</p><p><strong>Not sure what to read next? I’ve picked another article for you:</strong></p><p><a href="https://medium.com/analytics-vidhya/data-science-statistical-basics-b32a5e842325">Data Science: Statistical Basics</a></p><p><strong>Narcis Teodoroiu</strong></p><ul><li><strong><em>Did you found the article interesting? FOLLOW me on </em></strong><a href="https://medium.com/@narcis.teodoroiu"><strong><em>Medium</em></strong></a><strong><em>.</em></strong></li><li><strong><em>If you are interested in networking, let’s CONNECT on </em></strong><a href="https://www.linkedin.com/narcis-gabriel-teodoroiu/"><strong><em>LinkedIn</em></strong></a><strong><em>.</em></strong></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6b31fa7d7cbd" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Data Science: Statistical Basics]]></title>
            <link>https://medium.com/analytics-vidhya/data-science-statistical-basics-b32a5e842325?source=rss-b66498019f16------2</link>
            <guid isPermaLink="false">https://medium.com/p/b32a5e842325</guid>
            <category><![CDATA[measures-of-spread]]></category>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[outliers]]></category>
            <category><![CDATA[normal-distribution]]></category>
            <dc:creator><![CDATA[narcis teodoroiu]]></dc:creator>
            <pubDate>Sun, 22 Aug 2021 09:43:15 GMT</pubDate>
            <atom:updated>2021-09-30T10:19:56.589Z</atom:updated>
            <content:encoded><![CDATA[<blockquote>According to Wikipedia: “<em>Data science is a “concept to unify </em><a href="https://en.wikipedia.org/wiki/Statistics"><em>statistics</em></a><em>, </em><a href="https://en.wikipedia.org/wiki/Data_analysis"><em>data analysis</em></a><em>, </em><a href="https://en.wikipedia.org/wiki/Informatics"><em>informatics</em></a><em>, and their related methods” in order to“understand and analyze actual phenomena” with data.</em>”</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XbTtvyFTs9qceqD-eT5MHQ.jpeg" /><figcaption>Source: pexels.com</figcaption></figure><p>If you have ever heard of Data Science, I am sure you already know that statistics are an important foundation of this beautiful field. Therefore I have decided to write this blog to present a series of basic concepts.</p><p>My mathematician mind makes me think in a structured way and I want my blogs to follow a similar pattern in which you can found a lot of images, examples and understand the concepts without having to read too much verbiage. That said, let’s start…</p><p><strong>Index</strong>:</p><ol><li><strong>Population and sample</strong></li></ol><p><strong>2. Mean, Median, Mode and Range</strong></p><p><strong>3. Distributions</strong></p><ul><li><em>Normal Distribution</em></li><li><em>Standardized Normal Distribution</em></li></ul><p><strong>4. Central Limit Theorem</strong></p><p><strong>5. Variability measures</strong></p><ul><li><em>Variance</em></li><li><em>Standard Deviation</em></li><li><em>Covariance</em></li><li><em>Coefficient of correlation</em></li></ul><p><strong>6. Outliers measures</strong></p><ul><li><em>Skewness</em></li><li><em>Kurtosis</em></li><li><em>IQR Method</em></li></ul><h3><strong>Population and Sample</strong></h3><p>A <strong>population</strong> is the entire group that you want to draw conclusions about. Whilst a <strong>sample </strong>is the specific group tat you will collect data from. It is a subset of the population.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7IV-AEckvESm5UHWRXMHmw.png" /><figcaption>Source: Omniconvert.com</figcaption></figure><h3>Mean, Median, Mode and Range</h3><p>They express measures of central tendency. In different ways they each tell us what value in a data set is typical or representative of the data set.</p><p>The <strong>mean </strong>is the same as the <strong>average value of a dataset</strong>.</p><p>The <strong>median </strong>is the <strong>central number</strong> of the dataset.</p><p>The <strong>mode </strong>is the<strong> number that occurs most frequently</strong> in a dataset.</p><p>The <strong>range </strong>is the <strong>difference between the lowest value and the highest value</strong>.</p><p><em>Example: 7, 3, 4, 1, 7, 6</em></p><ul><li><em>Mean: (7+3+4+1+7+6)/6 → 4.66</em></li><li><em>Median: 1, 3, 4, 6, 7, 7 → (4+6)/2=5</em></li><li><em>Mode: 7, 3, 4, 1, 7, 6 → 7</em></li><li><em>Range: 7–1 → 6</em></li></ul><blockquote><strong>Distributions</strong></blockquote><h3>Normal/Gaussian Distribution</h3><p>It is a type of <strong>continuous probability distribution</strong> for a real random variable.</p><p>Can be described with just two parameters, <strong>mean </strong>and <strong>standard deviation</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*qr0_xb-XTVT7qLZh2lpOjw.jpeg" /><figcaption>Source: Michael Galarnyk</figcaption></figure><p><strong>Properties</strong>:</p><ul><li>The mean, mode and median are all equal.</li><li>The curve is symmetric at the center (i.e. around the mean).</li><li>Exactly half of the values are to the left of center and exactly half of the values are to the right.</li><li>The total area under the curve is 1.</li><li>Skewness and kurtosis.</li></ul><p><strong>Application in Machine Learning:</strong></p><ul><li>Data satisfying Normal Distribution is beneficial for model building. It makes math easier.</li><li>Algorithms which use Normal Distributions: Logistic Regression, Linear Regression, etc., are explicitly calculated from the assumption that the distribution is normal. So, we need to normalize the data before applying some machine learning algorithms.</li></ul><p><strong>Why it is important?</strong></p><ul><li><strong>Found in the natural phenomena</strong>: Is the most important probability distribution in statistics because it fits many natural phenomena like age, height, test-scores, IQ scores, sum of the rolls of two dice and so on.</li><li>Mathematical reason: <strong>Central Limit Theorem</strong>.</li><li><strong>Simplicity in mathematics</strong>. Namely, it’s mean, median and mode are all same. The entire distribution can be specified using just two parameters: mean and standard deviation.</li><li>Unlike many other distributions that change their nature on transformation, <strong>a Gaussian tends to remain a Gaussian</strong> (Product of two Gaussians is a Gaussian, convolution of Gaussian with another Gaussian is a Gaussian).</li></ul><p><em>Normal distribution in real life:</em></p><ul><li><strong><em>Height</em></strong><em>. Most of the people in a specific population are of average height. The number of people taller and shorter than the average height people is almost equal, and a very small number of people are either extremely tall or extremely short.</em></li><li><strong><em>Rolling a dice</em></strong><em>. In an experiment, it has been found that when a dice is roller 100 times, changes to get ‘1’ are 15–18% and if we roll the dice 1000 times, the changes to get ‘1’ is, again, the same.</em></li><li><strong><em>IQ</em></strong><em>. The intelligence quotient of a majority of the people in the population lies in the normal range whereas the IQ of the rest of the population lies in the deviated range.</em></li><li><strong><em>Technical stock market</em></strong><em>. The changes in the log values of Forex rates, prices indices and stock prices return often form a bell-shaped curve. For stock returns, the standard deviation is often called volatility. If returns are normally distributed, more than 99 percent of the returns are expected to fall within the deviations of the mean value.</em></li><li><em>And many more (Shoe size, birthday weight, income distribution in economy, etc.)</em></li></ul><h3>Standard Normal Distribution</h3><p>The standard normal distribution is a special case of the normal distribution where <strong>the mean is 0</strong> and the <strong>standard deviation is 1</strong>. This process is called <strong>standardization</strong>.</p><p>The normal distribution can take on any value as its mean and standard deviation. In the standard normal distribution, <strong>the mean and standard deviation are always fixed</strong>.</p><p>Every normal distribution can be converted to the standard normal distribution by turning the individual values into <em>z</em>-scores.</p><p>N(μ, σ) → Standard Normal Z ∼ N(0, 1)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/786/1*Z329RhefOiERCU3f6PorCA.jpeg" /><figcaption>Source: mathisfun.com</figcaption></figure><p><strong>Empirical rule</strong>: 68/95/99.7</p><ul><li>68% of observations within +- stdev from the mean.</li><li>95% of the observations are within +-2 stdev from the mean.</li><li>99.7% of observations are within +-3 stdev from the mean.</li><li>Values outside of +- 3 stedv account for less than 0.3% of observations, and, depending on the situation, could be considered outliers or signal noise.</li></ul><p>We convert normal distributions into the standard normal distribution for several reasons:</p><ul><li>To find the probability of observations in a distribution falling above or below a given value.</li><li>To find the probability that a sample mean significantly differs from a known population mean.</li><li>To compare scores on different distributions with different means and standard deviations.</li></ul><h3>Central Limit Theorem</h3><p>Introduction in context: “<em>Suppose we want to study the average age of the whole population of China. As the population of China is very high, it will be a tedious job to get everyone’s age data and will take a lot of time for the survey. So instead of doing that we can collect samples from different parts of China and try to make an inference. </em><strong><em>To work with samples we need an approximation theory which can simplify the process of calculating mean age of the whole population</em></strong><em>. Here </em><strong><em>the Central Limit Theorem</em></strong><em> comes into the picture. “</em></p><p><strong>Definition</strong>: If you sample batches of data from any distribution and take the mean of each batch. Then the distribution of the means is going to resemble a Gaussian distribution — <em>no matter what the shape of the population distribution.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*zPVSlwU2oSsDuargnF68kA.png" /><figcaption>Source: Wikipedia</figcaption></figure><blockquote><strong>Variability measures</strong></blockquote><h3>Variance (<strong><em>σ²</em></strong>)</h3><p><strong>Definition</strong>: The average of the squared differences from the mean.</p><p>Disadvantage: It is expressed in much larger units (e.g., meters squared)</p><h3>Standard Deviation (<strong><em>σ)</em></strong></h3><p><strong>Definition</strong>: Measure of how spread out numbers are. This indicates how much the dataset deviates from the mean of the sample.</p><p>Advantage: Is expressed in the same units as the original values (e.g., meters)</p><h3>Covariance</h3><p><strong>Definition</strong>: Measure the <strong>directional relationship between two variables</strong>.</p><p>Covariance is zero in case of independent variables because then the variables do not necessarily move together.</p><p>Disadvantages:</p><ul><li>Range: <strong>-∞ and +∞</strong></li><li>Is affected by the change in scale.</li></ul><h3>Coefficient of correlation</h3><p><strong>Definition</strong>: Measure the <strong>strength of the relationship between two variables</strong>. It is the normalized measurement of the covariance.</p><p>Independent movements do not contribute to the total correlation. Completely independent variables have a zero correlation.</p><p>Advantages:</p><ul><li>Range: -1 and +1</li><li>Is not influenced by scaling.</li></ul><blockquote><strong>Outliers measures</strong></blockquote><h3><strong>Skewness</strong></h3><p>Skewness is the measure of how much the probability distribution of a random variable deviates from the normal distribution. It is useful for the outliers checking. It <strong>measures the lack of symmetry</strong> in a data distribution.</p><p>There are two types of skewness:</p><ul><li><strong>Positive skewness</strong>. The tail on the right side of the distribution is longer or fatter. <strong>Mode&lt; Median &lt; Mean</strong>.</li><li><strong>Negative skewness</strong>. The tail on the left side of the distribution is longer or fatter. <strong>Mean &lt; Median &lt; Mode</strong>.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4B6SL0yX65G1M7ggWVSrfg.jpeg" /><figcaption>Image: Sigmamagic.com</figcaption></figure><p>Why is important?</p><p>The tail region may act as an outlier for the statistical model and we know that outliers adversely affect the model’s performance, especially regression-based models. So there is a necessity to transform the skewed data to close enough to a Gaussian distribution.</p><h3>Kurtosis</h3><p>Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, it identifies whether <strong>the tails contains extreme values</strong> in a given distribution.</p><p>There are three types of kurtosis:</p><ul><li><strong>Normal Kurtosis.</strong> A normal distribution has a kurtosis of 3.</li><li><strong>High Kurtosis (&gt;3).</strong> Distribution is longer, tails are fatter. Is an indicator that data has outliers. If there is a high kurtosis, then, we need to investigate why we have so many outliers.</li><li><strong>Low Kurtosis (&lt; 3). </strong>Distribution is shorter, tails are thinner than the normal distribution. Is an indicator that data has a lack of outliers. If we get low kurtosis (too good to be true), then also we need to investigate and trim the dataset of unwanted results.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kN9tzq9IRa1mANrySp6DwQ.png" /><figcaption>Source: Analystprep.com</figcaption></figure><h3>IQR Method</h3><p><strong>Interquartile range is the difference between Q3 and Q1.</strong></p><p>Properties:</p><ul><li>The<strong> median is the center point</strong>, also called second quartile, of the data (resulting from the fact that the data is ordered).</li><li>Q1 is the first quartile of the data, i.e., to say<strong> 25% of the data lies between minimum and Q1</strong>.</li><li>Q4 is the third quartile of the data, i.e., to say <strong>75% of the data lies between minimum and Q</strong>3.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jMA2zpOAEPbNPcQgdiDs9A.png" /><figcaption>Source: Wikipedia</figcaption></figure><p>To detect the outliers using this method, we define a new range and any data point lying outside this range is considered as outlier and is accordingly dealt with. The range is as given below:</p><ul><li><strong>Lower Bound</strong>: Q1 = -1.5 * IQR</li><li><strong>Upper Bound</strong>: Q3 = 1.5*IQR</li></ul><p><strong>Why ‘1.5’ ?</strong></p><p>The rest 0.28% of the whole data lies outside three standard decisions (<strong>&gt;3σ) </strong>of the mean<strong> </strong>(μ). <strong>This part of the data is considered as outliers</strong>. The first and the third quartiles, <em>Q1 </em>and <em>Q3</em>, lie at <strong>-0.675σ </strong>and <strong>+0.675σ </strong>from the mean, respectively. <strong>To get exactly 3σ, we need to take the <em>scale </em></strong><em>= </em>1.7,<em> </em>but then 1.5 is more “symmetrical” than 1.7 and we’ve always been a little more inclined towards symmetry.</p><h3>Thanks for reading this far!</h3><p>I hope you found this insightful and helps you in your data science career :) If you enjoyed the content, be sure to follow me on Medium. As always, I wish you the best in your learning endeavors!</p><p><strong>Narcis Teodoroiu</strong></p><ul><li><strong><em>Did you found the article interesting? FOLLOW me on </em></strong><a href="https://medium.com/@narcis.teodoroiu"><strong><em>Medium</em></strong></a><strong><em>.</em></strong></li><li><strong><em>If you are interested in networking, let’s CONNECT on </em></strong><a href="https://www.linkedin.com/narcis-gabriel-teodoroiu/"><strong><em>LinkedIn</em></strong></a><strong><em>.</em></strong></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b32a5e842325" width="1" height="1" alt=""><hr><p><a href="https://medium.com/analytics-vidhya/data-science-statistical-basics-b32a5e842325">Data Science: Statistical Basics</a> was originally published in <a href="https://medium.com/analytics-vidhya">Analytics Vidhya</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>