Building Blocks of Data Science Part — II

Published in

Data Science - With Live Case Studies

11 min readSep 13, 2018

This is continuation from Part — I

MEASURES OF SHAPE

Now, we have data on hand, and we have calculated the central tendency measure, we also know how each value in the data series varies from the central value.

We now need to see a graphical representation of the data and analyze the shape which describes the rough distribution of data from its central value. Two Measures of Shape are:

1) SKEWNESS

2) KURTOSIS

Skewness and Kurtosis help you find out the pattern of your data distribution, help find extreme or outlier values and also tell you about the symmetry of your data distribution

When we find out the central tendency value for a data series, we expect that the MEAN/MEDIAN/MODE would divide the values into halves and expect to see the right half and left half of the graph almost similar. This distribution of data in which the right half is a mirror image of the left half is said to be SYMMETRICAL.

1) SKEWNESS — Skewness is when a distribution is asymmetrical or lacks symmetry, the data is said to be skewed either right ways or left ways. When you draw a curve graph of the data, if the bell shaped curve is either to the left or right, it is said to be skewed.

The skewed portion is the long thin part of the curve. If the bell curve is to the right and the skew tail is to the left, then it is NEGATIVELY SKEWED. If the skewed portion is to the right and the curve is to the left, it is POSITIVELY SKEWED. The skew distributions denote that the data are sparse at one end of the distribution and piled up at the other end.

Skewness shows that more number of data values lie in a particular side of the central value.

Ex: Skew shows that more number of students have scores in a particular (lower or higher) range in the data

· In a unimodal distribution, the distribution has a single peak or mode. This would have the mode as the apex (highest point of the curve) and the median will be the middle value.

· The mean tends to be located towards the tail of the curve, because the mean is affected by all values including the extremes.

· A bell shaped or normal distribution with the mean, median and mode all at the center of the distribution has no skewness.

· Helps ensure your sample data has low, mid and high values and the sample taken has a good mix of all kinds of data values.

CO-EFFICIENT OF SKEWNESS:-

Karl Pearson is said to have developed at least two coefficients of skewness that can be used to determine the degrees of skewness in a distribution. One of them “Compares the Mean and Median in light of the magnitude of the Standard Deviation”. That is, if the distribution is symmetrical, the Mean and Median are the same value and hence the coefficient of skewness is zero.

Sk=3(µ-Md) / σ (where Md is Median)

Ex: The Mean score of a class is 29, Median is 26 and has a SD of 12.3, the coefficient of skewness would be Sk= 3(29–26)/12.3 which is +0.73

Now that the skewness is positive, it is said to be positively skewed, which means a lot of values lie to the right of the central value, here, a lot of students have scored more than the Mean/Median. If the Sk value was negative, then more students scored less than the average and the distribution would be negative. The greater the magnitude of Sk, the more skewed is the distribution. Generally, values of skewness lie between -0.5 and +0.5 are said to be less skewed or almost normally distributed. Skewness lesser than -0.5 is strongly negatively skewed and more than +0.5 is very strongly positively skewed. The outliers always lie in the tails.

Generally in skewed distributions, the Mode will be the highest peak (freq occurring value), and the median would be centered and the mean would lie closer to the tails as they are affected by extreme outliers also.

2) KURTOSIS:- Explains the amount of peakedness of the distribution curve. Distributions that are high and thin are referred as Leptokurtic, distributions that are flat and spread out are called Platykurtic and the distributions which are almost normal bell shaped are Mesokurtic .

a. If a distribution is Leptokurtic, then there are frequently occurring values and the mode will be the peak. K>3, then leptokurtic.

b. If a distribution is Mesokurtic (k is appx =3), the mode is the highest peak, but there are other values around the Mean/Median — ideally should all lie closer to central point.

c. If a distribution is platykurtic (K< 3), then data is distributed or spread out and not too centered.

APPLICATION OF VARIANCE/STD DEVIATION

1) EMPIRICAL RULE

2) CHEBYSHEV’s THEOREM

1) EMPIRICAL RULE: — This rule is used to state the approximate percentage of values than lie within a given number of standard deviations from the mean of the data set if the data are normally distributed. The requirement that the data be normally distributed contains some tolerance, and the empirical rule generally applies as long as the data is mound shaped.

µ±1σ 68%

µ±2σ 95%

µ±3σ 99%

If data is normally distributed, approximately 68% of the values are within one standard deviation of the mean, 95% of values are within two std deviations and almost all the data within three std deviations.

Ex:- Suppose the gasoline price average across the state is $3.12, and the standard deviation is $0.08 and if the data is assumed to be normally distributed, then according to empirical rule,

a) Approximately 68% of the prices should fall within µ+_1σ, ie, 3.12±0.08

b) 95% should fall within µ+_2σ, 3.12±2X0.08

c) 99% of the data should fall within µ±3σ, ie, 3.12±3X0.08

2) CHEBYSHEV’s THEOREM

The Empirical rule applies only when data are known to be approximately normally distributed. Chebyshev’s theorem applies to all distributions or when the type of distribution is either not-normal or unknown. This theorem applies to all distributions irrespective of their shapes. This theorem is not a rule, but a formula and therefore is widely applied. It calculates the percentage of values that lie within ±K standard deviations from the mean.

CHEBYSHEV’s Theorem states that at least 1-(1/k2) values will fall within ±K standard deviations of the mean, regardless of the shape of the distribution.

Within K standard deviations of the mean, µ±kσ, lie atleast 1 — (1/k2) proportion of the values (assumption k>1)

Ex: if k=2.5, then 1–1/(2.5)2 = 0.84. Hence 84% of the data should fall in the ±2.5σ range of the mean.

This says that at least 75% of all values are within ±2σ of the mean, whereas the EMPIRICAL rule says 95% of data values are in ±2σ of the mean.

ESTIMATION OF POPULATION PARAMETERS

The overall objective of descriptive statistics is to give you a detailed description of the data you have on hand. As we have only limited data or sample data on hand, we are mostly required to estimate the population parameters from the sample data. The parameter calculated from the sample data is not 100% accurate and might result in small errors while estimating the parameters for the population.

Estimates can be of 2 types:-

a) POINT Estimate

b) INTERVAL Estimate

a) POINT ESTIMATION

It is a statistic taken from a sample that is used to estimate a population parameter, which is as good as the representativeness of its sample. If other random samples are taken from the population, the point estimates derived from those samples are likely to vary. The variations or errors likely to rise out of different samples are called STANDARD ERRORS.

b) INTERVAL ESTIMATION

Because of the variation in sample statistics, estimating a population parameter with an interval estimate is often preferable to using a point estimate. An Interval estimate (Confidence interval) is a range of values within which the analyst can declare, with some confidence, the population parameter lies. Confidence intervals can be two-sided or one side. In simple words, Confidence Intervals are a range of values within which the estimates can fall.

Definition: A confidence interval for a parameter is an interval of numbers within which we expect the true value of the population parameter to be contained. The endpoints of the interval are computed based on sample information

Certain factors may affect the confidence interval size including size of sample, level of confidence, and population variability. A larger sample size normally will lead to a better estimate of the population parameter.

Most of the population parameters can be estimated based on sample statistics.

ESTIMATING THE POPULATION MEAN

The confidence interval is represented by z which is the area under the normal distribution, which is taken into consideration by the analyst to arrive at the population parameter estimates. This z represents the data values in percentages, which are considered significant for his analysis and estimate. There is an Alpha (ά) concept which is the area under the normal curve in the tails of the distribution outside the area defined by the confidence interval.

The CI yields a range within which we feel with some confidence, the population mean is located. The interpretation is like this, it is not certain that the population mean is in the interval unless we have a 100% confidence interval that is infinitely wide. If we want to construct a 95% CI, the level of confidence of the analyst is 95% or 0.95 that the intervals would include the population mean and five would not include.

In reality, a CI with 100% confidence would be meaningless. So researches go with 90%, 95%, 98% or 99% max. The reason is that there is a trade-off between sample sizes, interval width, level of confidence etc. For Ex: as the level of confidence is increased, the interval gets narrower. Which means data coverage range / data distribution (z level) and CI have an inverse relationship. As the analyst takes wider data coverage (z value), his chances of increasing estimate accuracy increases and his Confidence Interval will be more precise and narrow.

How confident are we that the true population average is in the shaded area? We are 95% confident. This is the level of confidence. How many standard errors away from the mean must we go to be 95% confident? From -z to z there is 95% of the normal curve.

There are 4 typical levels of confidence: 99%, 98%, 95% and 90%. Each of the levels of confidence has a different number of standard errors associated with it. We denote this by

where a is the total amount of area in the tails of the normal curve. Thus, for a 95% level of confidence, the z values from the table are:

After selecting (or being told) that level of confidence, for a large (n>30) sample we use the formula

ESTIMATING POPULATION PROPORTION

We just saw the estimation of Mean (ex: estimating the average score of a cricketer, the mean scores of all male students in a management course etc with intervals and confidence levels). Business Decision makers often need to be able to estimate a population proportion. Estimating market share (their proportion of the market) is important for them. Market segmentation opportunities come from knowledge of the proportion of various demographic characteristic among potential clients. More examples may be the proportion of female students completing the course in the first attempt, proportion or ratio of students coming back to the same university for a higher degree/course, after completing the basic course. Ratio of students passing the IAS entrance after a coaching etc.. All the estimations can be just based on a sample size data taken from the whole population and estimates can be projected for the whole population size. All these are estimated not in terms of Mean or Average, but in terms of proportions and percentages. Here again, for ex: the CI will be like 92.3% to 94.5% of students would pass the exam after taking up the course etc..

Ex: if the sample proportion of telebrand marketing proportion is 0.39 or 39%, we estimate the proportion of telebrand marketing for a population based on sample size of 87 observations, at a confidence interval of 95%. Using the same formula:

ESTIMATING POPULATION VARIANCE

At times the researcher is more interested in the population variance than in the population mean or population proportion. For Ex: in the total quality checks, suppliers who want to earn world-class supplier status or even those who want to maintain customer contracts are often asked to show continual reduction of variation on supplied parts. Essentially to minimize variations in production and to maintain consistency in quality, tests are conducted on samples to determine lot variation and whether consistency goals are being met. For ex: Variations in airplane altimeter readings need to be minimal, it is not just enough to know the average, a particular brand of altimeter produces the correct altitude. Thus measuring the variations of altimeters is critical. Variations mean the differences from the strict quality specifications prescribed. Mostly, the quality in such situations needs to be accurate and any differences have a drastic effect.

The relationship of the sample variance to the population variance is captured by the CHI-SQUARE Distribution (x2). That is , the ratio of the sample variance (s2) multiplied by N-1 to the population variance (σ2) is approximately chi-square distributed, if the population is normally distributed. This does not suit conditions where the data is not normally distributed.

Degrees of Freedom:

Degrees of Freedom refers to the number of independent observations for a source of variation minus the number of independent parameters estimated, in computing the variation. So, if you have 50 observations and you calculate 2 parameters, your DF will be N-parameters, 50–2=48.

To be continued in Part — III, we will look at Testing of Hypothesis