Building Blocks of Data Science Part — I
INTRODUCTION TO STATISTICS
Statistics is derived from Italian words ‘State’ and ‘Statista’ — which means ‘The Government’ and ‘Useful facts/information’. It is a branch of mathematics and deals with
· Collection of facts/information/data
· Arrangement, classification and tabulation of data
· Analysis and detailed study of the collected data
· Interpretation, conclusions and understanding of the data
· Helps important business decision making.
TYPES OF STATISTICS
a) Descriptive statistics
b) Inferential Statistics
Descriptive Statistics — as the name suggests describes the data collated or collected for a specific business purpose.
Inferential Statistics — involves detailed study, analysis and interpretation of data to draw conclusions based on Descriptive statistics.
Collection of Data/useful facts
All relevant information required for a particular business purpose is collected from the public/population by research companies through various methods.
· Personal enquiry or interviews
· Through observation and study of a particular pattern or behavior
· In the form of questionnaires
While doing so, various factors determine the amount and scale of information to be collected.
· The purpose for which the information/data is collected.
· Target audience or the particular section of people from whom information needs to be collected.
· The number of people from whom data needs to be collected (population /sample size)
· Age group, geographic location, areas of interest etc
· Income groups and various sections of the economy to be targeted
· The kind of analysis that needs to be done with the collected information
· The level of accuracy to be maintained at every stage of data collection/analysis
· Business Decisions to be arrived based on statistical analysis and interpretation of information.
Population — is the collection of objects/people which form the larger group of the analysis.
Sample — is also the collection of objects/people which are usually a subset of the larger population or sometimes is the only size available for analysis.
TYPES OF DATA
Data is nothing but collection of relative facts/information in a particular form as required by the business. We classify data based on
· value of the attribute
· characteristics of the object
· level of analysis that can be done with that value collected
A broader or simpler classification is
a) Qualitative data — this data suggests only the quality of the variable/attribute. Also called as Categorical data, this attribute is useful for classifying data into groups or classes. It is more applicable when there are limited or fewer values possible for that data. Qualitative data cannot be measured.
Ex: Region — South, North, East, West
Gender — Male, Female
Nationality — Indian, American, African
b) Quantitative Data — is all kinds of information / data which can be quantified by assigning numeric values to them. These can be any number from 0 or negative numbers up to infinity. In simple terms, data that can be measured is called quantitative data. The possibility of values for this attribute is unlimited unlike Qualitative data.
Ex: Height, Weight, Salary, temperature etc.
Here, the famous NOIR classification should be recollected.
When a research or a survey is done, it is advisable to have a combination of all data types, as this would give an in depth view of the purpose of the market research. It is always better to have focus on the purpose of the market research, but also to collect a little more information than required, so that import data does not get omitted in the analysis.
a) NOMINAL Data — assigns a name as a value. This value is just a representation of which category the value belongs to. This is usually a Character data and not a numeric. No further analysis is possible other than being used for classification / categorization. They have some kind of an identity.
Ex: Company working in, Region, Name of a person etc…
b) ORDINAL data — is also similar to nominal data, but ordinal data can be ranked in relation to other values in the same variable. Ordinal data values have some level indications within themselves.
Ex: Small, Medium, Large
Bad, Average, Good, Very Good, Excellent
Scale of 1 to 5 (where the numbers 1–5 represent only ranking)
c) INTERVAL SCALE DATA — are quantitative data, and can be used for numerical analysis. They might have equal intervals between the values, or can also be continuous data. They do not have ZERO reference point, which means they can be positive or negative numbers. Usually, it would be possible for us to take the difference between any 2 values in this kind of data to get a meaningful result, but a division of values would not fetch a meaning result.
Ex; Age groups, Salary ranges, dates (where the difference can be taken, but a division cant)
d) RATIO SCALE DATA — This is measureable data, and can be used for all kinds of econometrics. They have a zero reference point, which means they are usually positive data.
DESCRIPTIVE STATISTICS
Descriptive statistics describe a lot about your data. They are very important as they decide the quality and type of data and levels of accuracy for all inferential statistics and ultimate business decision making. All inferential statistics are based on descriptive statistics.
When you receive data, you are expected to have all the values in a normal distribution. Which means, that the data should all revolve around some central value or the average/mean of the values should represent the entire data value range? Sometimes, there may be values which are slightly off the normal values expected, either on the lower side or the higher side. Such values which lie outside the normal values are called OUTLIERS. These outlier values can affect your analysis and hence the treatment of these outlier values depends on the company’s decision to handle or ignore them.
Descriptive statistics give lot more insight into not only the central values, but also about how much they vary or deviate from the central values, ways to handle negative values, categorization of subsets from populations, simple comparisons, representation of these values in graphs, shapes of graphs/plots etc.
Descriptive statistics are broadly classified into:
1) Measures of Central Tendency
2) Measures of Dispersion/Variability
3) Measures of shape
MEASURES OF CENTRAL TENDENCY
They explain how the values of a particular variable/attribute are centered around the central value. When a central value in a data range is calculated, it is expected to be a representative of all the values in that variable and not too different. The objective of calculating this is to find out an approximate central value for the entire range of data. Whatever is the Measure of central tendency you choose, it should exactly divide your data into half and 50% of data values should lie to the right and left of the mean. This kind of data is called normal data.
There are 3 ways of calculating Measures of Central Tendency: — (Mean, Median & Mode)
- MEAN — While mean is Mean is extremely sensitive to outliers as even one or few values can influence the center value. Mean is the most common measure of central tendency. It is calculated as the total of all the values by the number of values or frequency. Mean does not take into calculation any missing values. Zero is still considered as a value in the data range. Mean uses all the data and each value influences the mean. Average is also similar to mean, but one major difference is that Average takes all missing values also into account, which will not be accurate as required.
MEAN = Sum of all values / total no. of values
Mean for a population is represented by Mu, and a Sample mean is represented by X bar.
Ex: 2, 3, 5, 4, 5, 6, 8, 20 — The mean is 6.125, but the outlier 20 affects it.
If 20 is removed, then the mean is 4.14 which is a more accurate central value.
The choice of treating the outlier values (lower or higher values) depends on individual companies and the level of accuracy expected for their analysis. If there are too many outliers, then the data first has to be normalized and then analyzed. Of if there are few outlier values, companies may choose to ignore them or treat them separately in their analysis.
- MEDIAN — Is the exact central value of the data range, when arranged either in ascending or descending order of values. It is the central part of the data values. When there is even no. of values in the data range, Median is the average of the 2 middle values in the arranged series. Hence Median also is called Positional Average.
Ex: 4, 5, 3, 2, 7 (arranged in ascending 2, 3, 4, 5, 7) the MEDIAN is 4.
4, 5, 3, 2, 7, 5 (arranged in descending 7, 5, 5, 4, 3, 2) the MEDIAN is (5+4)/2 = 4.5
Main Advantage of using a median over mean — Median is less affected by outliers.
Ex: 20, 30, 35 (arranged 20, 30, 35) — Median is 30
20, 30, 35, 140 — Median is (30+35)/2 = 32.5 (not affected much by the outlier)
Disadvantage of Median is that is does not take all the values into account. It is based on only the middle value by virtue of its position in the center.
- MODE — Most repeated value in the data series. Usually repeat values occur in the data series only if you have limited possible values in that data. Ie, when only limited/pre-defined values are entered into the data series, MODE represents the most popular or preferred value. Used mostly for categorical /nominal data and not much quantitative data. When data is highly repetitive, you go for the mode.
Ex: — 2, 3, 6, 3, 3, 3, 2, 5 = Mode is 3 because 3 is most repeated.
2, 3, 4, 2, 3, 3, 2, 5 = both 2 and 3 are repeated equal no. of times, this kind of data is called Bi-modal.
2, 3, 4, 5, 6 — there are not repeat values, so the data is non-modal.
Classic example of Mode usage is categorical data — where there are limited possible values.
Ex: Bad, Poor, Average, Good, Excellent — the number of times each value is repeated represents the prefer ability of the product/service and hence the Mode becomes important.
MEASURES OF CENTRAL TENDENCY — USAGE CHART
There are other measures of Central Tendency which divide the data not into two halves. They are:-
a) QUARTILES b) PERCENTILES c) Deciles
a) QUARTILES — the Quartiles divide data into 4 parts in an arranged series. Which essentially means, that the first 25% of the data constitutes lower outliers and other values, the second 25% will be closer to the central values, the third 25% will also lie slightly above the central value and the last 25% will have higher values and high outliers. Now the central data will lie between Q2 and Q3 and the outliers in Q1 and Q4.
Quartile values are calculated as follows:-
Q1 = N X 25% — the value that falls in this nth observation is taken.
Q2 = N X 50%
Q3 = N X 75%
When the number of the observation is a whole number, the value in that whole number value + the next value divided by 2 will be the quartile value. If Q3=7, then 7the value +8th value / 2 is the value of Q3.
When the value of the observation is a decimal, always the nearest higher number of the decimal becomes the value to be taken. If Q1=4.2, then 5th value in the series should be taken.
Ex: In a series of arranged data 106, 109, 114, 116, 121, 122, 125, 129 (N = 8)
Q1 = 25% of 8 = 2 (2nd value is 109, since it is an even whole no. (109+114)/2 = 111.5 is the value of the Q1. Which means the data values 109 and 114 lie in the first 25% of the data series.
Q2 = 50% of 8 = 4 (since it is a whole no. (116+121)/2 = 118.5 is the value of Q2.
Q2 will essentially be the MEDIAN value, as Q2 divides the data into exact half.
Q3 = 75% of 8 = 6 (since it is a whole no. (122+125)/2=123.5.
If total no. of observations are 7, then Q1=25% of 7 = 1.75, so the second value should be taken.
b) PERCENTILES — They divide the data into 100 parts. There are 99 percentiles as it takes 99 dividers to divide the data into 100 parts. They are represented by P1 up to P99.
i = (P/100) X N
Where P represents the number of the Percentile you want to calculate
N is the total no. of values in your data.
Ex: If there are 25 observations, P30= (30/100) X25 = 7.5, now we take the 8th value as P30
Note: — As always P50, is equal to Q2 and is equal to the MEDIAN.
c) DECILES — they divide data into 10 parts and essentially there are 9 Deciles, represented by D1-D9. Logic is same as above.
MEASURES OF DISPERSION / VARIABILITY
While the Measures of central tendency give insight into the central values of your data, Measures of Dispersion tell you how much they are deviated from the Central Values. In reality, data is mostly not very consistent and always data is varied. There is a need to calculate the differences to get more relevance in the data and not just talk about middle / central values in our analysis. They describe the spread or the dispersion of the data set.
Ex: the scores of students in Univ1 and Univ2, though have a mean of 7.5, their data is distributed differently. In such cases, a study of Central value alone is insufficient and we need to look at how the data are dispersed from the Mean. This is because data is not normally distributed. Hence the measures of variability help us find how the data values are scattered.
From the data below, we can infer that Univ 1 is better because…
1) Dispersion of values is less,
2) Spread is less
3) Range of scores is less
4) Variations are less
3 main measures of Dispersion are — RANGE, Standard Deviation and Variance.
Other measures are Interquartile Range, Mean Absolute Deviation (MAD), z Scores and Co-Efficient of Variation.
1) RANGE — Range is the difference between the highest and the lowest value in the data values. Range does not take any other values, especially the central values into account, hence not so popularly used for statistical analysis. Range is affected by outliers.
Mostly used to find out what band the values of the data fall under.
Ex: salary, range, age group, — mostly used as intervals too
2) INTERQUARTILE RANGE — is the difference between Q3-Q1 = 75%-25% which constitutes 50% of data revolving around central values, throwing out the outliers. Interquartile range is specifically used where data users are more interested in the values towards the center and less interested in extremes.
SUM OF DEVIATIONS FROM THE ARITHMETIC MEAN IS ALWAYS ZERO.
If you have data set having 12 values, and their mean is 4.83, the sum of deviation of each value from the mean value of 4.83 is first calculated by ∑(x-µ), where x represents each data value and mu is the mean. The sum of (x-mu) is always zero.
3) MEAN ABSOLUTE DEVIATION — Deviations from the MEAN value is first calculated, but the negative signs are ignored and absolute numbers are taken. Then they are summed up. I.e., we find out how each value in the data set varies from the Mean. All absolute deviations are summed up to get the Total Deviation. MAD is the average of the absolute values of the deviations around the mean, for a set of values. Only absolute values are taken and negative values are ignored so that meaningful analysis is possible. But still useful in the field of forecasting, used as a measure of error.
THE MAD of 1.36 states that the average deviation or difference of each of the values in the data set, from the mean is 1.36.
4) VARIANCE — Since MAD neglects all the –ves and treats them as positive, Variance uses a different calculation. Variance is just the square of all negative and positive deviations’ from the mean, divided by the total number of obs. Variance is the sum of all the square deviations from the mean by N.
Variance = ∑(x1 — xbar)/N ….
Sample Variance = ∑(x1 — xbar)/n-1
Population variance is denoted by σ2
Sample Variance is denoted by S2
In the below example, Variance is 33.668/12 = 2.8056.
5) STANDARD DEVIATION — SD is the most popular measure of variability. Used mostly in computing confidence intervals and Hypothesis Testing. Since Variance takes both negative and Positive Square of deviations, it may not be the accurate measure. SD is the square root of the sum of all variances. Denoted by SIGMA σ, is also the Square root of Variance. SD is expressed in the same units as the data, and Variance is expressed in the square units of the data.
In the above ex, SD is the square root of 2.8056 = 1.675
Since MAD neglects all the –ve numbers, SD takes both –ve and +ve numbers into consideration and removes the square effect of variance. Hence SD is the most effective measure.
CO-EFFICIENT OF VARIATION: — This is another relative measure of dispersion which is free of all units of measurements. It can be considered as an extension of STD Dev also. It is simply calculated by dividing the STD DEV by MEAN X 100. Though it is multiplied by 100, it is not expressed in percentages and is free of units of measurements. In the above ex, Mean is 4.8, STD Dev is 1.68, hence the Co-efficient of Variation is
Co-Efficient of Variation = (STD Dev σ / Mean µ)X100
(1.68/4.8)*100 = 35
Main usage/application of Co-efficient of variation is that it is very helpful when 2 or more variables measured in different units are compared. Ex: Height in CMS and Inches.
To be continued in Part — 2… Happy Learning!