Rajesh S. Brid
Sep 20, 2018 · 20 min read

For taking steps to know about Data Science, in the second of the series, I shall cover briefly an introduction to Statistics and main categories of Statistics.

Introduction to Statistics :

“sta-tis-tics”the only science wherein two recognized experts, using exactly the same set of data, may come to exactly opposite conclusions.

Related image
Related image

Statistics are the sets of mathematical equations that we used to analyze the things. It keeps us informed about, what is happening in the world around us. Statistics are important because today we live in the information world and much of this information’s are determined mathematically by Statistics Help.

Unlike mathematics statistics is not deterministic, it’s probabilistic. So you can’t arrive at a direct determined conclusion on the experiments or study which you conduct. In statistics all conclusions are made with certain probability at some level of significance with certain confidence interval which means the conclusion which you are arriving out of experiment may be within a confidence interval with certain level of significance.

For this cause you involve collection of required data for your experiment based on your interest of method which is available in literature. Post this you involve cleaning of data, select a suitable analysis for your study.

When used correctly, statistics tell us any trends in what happened in the past and can be useful in predicting what may happen in the future.

Statistics make it possible for us to make fairly accurate predictions with small groups of data. It is not possible to predict individual events but statistics will give insight to the overall results. Statistics let us make estimates about the future without knowing all the possible results. Statistics deals with two areas: the past and the future. We use statistics to summarize past events so we can understand them. We then use this summary to make predictions about the future.

“Facts are stubborn, but statistics are more pliable.” — Mark Twain

Statistics plays a main role in the field of research. It helps us in the collection, analysis and presentation of data. Statistics is concerned with developing and studying different methods for collecting, analyzing and presenting the empirical data.

The study of statistics can be categorized into two main branches. These branches are descriptive statistics and inferential statistics. Both of them give us different insights about the data. One alone doesn’t not help us much to understand the complete picture of our data but using both of them together gives us a powerful tool for description and prediction.

Some of the main terms used regularly in Statistics:

Population: It is the group that is targeted to collect the data from. Our data is the information collected from the population. Population is always defined first, before starting the data collection process for any statistical study. Population is not necessarily be people rather it could be batch of batteries, measurements of rainfall in an area or a group of people.

Sample: It is the part of population which is selected randomly for the study. The sample should be selected such that it represents all the characteristics of the population. The process of selecting the subset from the population is called sampling and the subset selected is called the sample.

When working with a large data set, it can be useful to represent the entire data set with a single value that describes the “middle” or “average” value of the entire set. In statistics, that single value is called the central tendency and mean, median and mode are all ways to describe it. To find the mean, add up the values in the data set and then divide by the number of values that you added. To find the median, list the values of the data set in numerical order and identify which value appears in the middle of the list. To find the mode, identify which value in the data set occurs most often. Range, which is the difference between the largest and smallest value in the data set, describes how well the central tendency represents the data. If the range is large, the central tendency is not as representative of the data as it would be if the range was small.

Probability: It is the measure of the likelihood that an event will occur. A probability distribution is a table or an equation that links each outcome of a statistical experiment with its probability of occurrence. Probability is quantified as a number between 0 and 1, where, loosely speaking, 0 indicates impossibility and 1 indicates certainty. The higher the probability of an event, the more likely it is that the event will occur. A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two outcomes (“heads” and “tails”) are both equally probable; the probability of “heads” equals the probability of “tails”; and since no other outcomes are possible, the probability of either “heads” or “tails” is 1/2 (which could also be written as 0.5 or 50%).

Statistic: A Single measure of some attribute of a sample. For e.g. Mean/Median/Mode of a sample of Data Scientists in Bangalore.

Population Statistic: The statistic of the entire population in context. For e.g. Population mean for the salary of the entire population of Data Scientists across India.

Sample Statistic: The statistic of a group taken from a population. For e.g. Mean of salaries of all Data Scientists in Bangalore.

Standard Deviation: It is the amount of variation in the population data. It is given by σ.

Standard Error: It is the amount of variation in the sample data. It is related to Standard Deviation as σ/√n, where, n is the sample size.

Descriptive Statistics

Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made. They are simply a way to describe our data.

Descriptive statistics are very important because if we simply presented our raw data it would be hard to visulize what the data was showing, especially if there was a lot of it. Descriptive statistics therefore enables us to present the data in a more meaningful way, which allows simpler interpretation of the data. For example, if we had the results of 100 pieces of students’ coursework, we may be interested in the overall performance of those students. We would also be interested in the distribution or spread of the marks. Descriptive statistics allow us to do this. Typically, there are two general types of statistic that are used to describe data:

  • Measures of central tendency: these are ways of describing the central position of a frequency distribution for a group of data. In this case, the frequency distribution is simply the distribution and pattern of marks scored by the 100 students from the lowest to the highest. We can describe this central position using a number of statistics, including the mode, median, and mean.
  • Measures of spread: these are ways of summarizing a group of data by describing how spread out the scores are. For example, the mean score of our 100 students may be 65 out of 100. However, not all students will have scored 65 marks. Rather, their scores will be spread out. Some will be lower and others higher. Measures of spread help us to summarize how spread out these scores are. To describe this spread, a number of statistics are available to us, including the range, quartiles, absolute deviation, variance and standard deviation.

When we use descriptive statistics it is useful to summarize our group of data using a combination of tabulated description (i.e., tables), graphical description (i.e., graphs and charts) and statistical commentary (i.e., a discussion of the results).

Descriptive statistics give information that describes the data in some manner. For example, suppose a pet shop sells cats, dogs, birds and fish. If 100 pets are sold, and 40 out of the 100 were dogs, then one description of the data on the pets sold would be that 40% were dogs.

This same pet shop may conduct a study on the number of fish sold each day for one month and determine that an average of 10 fish were sold each day. The average is an example of descriptive statistics.

Some other measurements in descriptive statistics answer questions such as ‘How widely dispersed is this data?’, ‘Are there a lot of different values?’ or ‘Are many of the values the same?’, ‘What value is in the middle of this data?’, ‘Where does a particular data value stand with respect with the other values in the data set?’

A graphical representation of data is another method of descriptive statistics. Examples of this visual representation are histograms, bar graphs and pie graphs, to name a few. Using these methods, the data is described by compiling it into a graph, table or other visual representation.

This provides a quick method to make comparisons between different data sets and to spot the smallest and largest values and trends or changes over a period of time. If the pet shop owner wanted to know what type of pet was purchased most in the summer, a graph might be a good medium to compare the number of each type of pet sold and the months of the year.

Inferential Statistics

Now, suppose you need to collect data on a very large population. For example, suppose you want to know the average height of all the men in a city with a population of so many million residents. It isn’t very practical to try and get the height of each man.

This is where inferential statistics comes into play. Inferential statistics makes inferences about populations using data drawn from the population. Instead of using the entire population to gather the data, the statistician will collect a sample or samples from the millions of residents and make inferences about the entire population using the sample.

The sample is a set of data taken from the population to represent the population. Probability distributions, hypothesis testing, correlation testing and regression analysis all fall under the category of inferential statistics.

In simple language, Inferential Statistics is used to draw inferences beyond the immediate data available.

With the help of inferential statistics, we can answer the following questions:

  • Making inferences about the population from the sample.
  • Concluding whether a sample is significantly different from the population. For example, let’s say you collected the salary details of Data Science professionals in Bangalore. And you observed that the average salary of Bangalore’s data scientists is more than the average salary across India. Now, we can conclude if the difference is statistically significant.
  • If adding or removing a feature from a model will really help to improve the model.
  • If one model is significantly better than the other?
  • Hypothesis testing in general.

Difference between Descriptive and Inferential Statistics :

stat
stat

S.n

Descriptive Statistics

Inferential Statistics

1.

Descriptive Statistics gives description or we can say it focuses on collection, presentation and characterization about a sample.

Inferential Statistics helps to predict and estimate the possible characteristics of the population from the sample data drawn from the population.

2.

Descriptive Statistics only describes some certain characteristics of the data.

Inferential Statistics deeply analyzes the statistical data and observations.

3.

Descriptive Statistics helps in dealing with the central tendency and spread of the frequency distribution.

Inferential Statistics helps in studying details about the hypothesis test and confidence level.

4.

Descriptive Statistics can be measured either numerically (mean, median, mode) or graphically.

Inferential Statistics cannot be always measured in exact numbers.

5.

Descriptive Statistics helps to produce error free results as it deals with the small population.

Inferential Statistics may not produce error free results as it takes whole population for drawing conclusion.

6.

Drawing conclusion in descriptive statistics is limited within the given data i.e. we cannot make conclusions beyond the given data.

Drawing conclusion in Inferential statistics is unlimited i.e. the educated predictions and guesses can be made on the basis of the parameters of the given population.

7.

Examples:

  • Information in newspapers, magazines, company reports, etc.
  • Frequency of the variables.
  • Population of the particular country.

Examples:

  • Grades or percentile of the scores.
  • Average score in cricket.
  • Prediction about cancer by the doctor.

Sampling Distribution :

A sampling distribution is a probability distribution of a statistic obtained through a large number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population. A Sampling Distribution behaves much like a normal curve and has some interesting properties like :

  • The shape of the Sampling Distribution does not reveal anything about the shape of the population.
  • Sampling Distribution helps to estimate the population statistic.

Central Limit Theorem

It states that when plotting a sampling distribution of means, the mean of sample means will be equal to the population mean. And the sampling distribution will approach a normal distribution with variance equal to σ/√n where σ is the standard deviation of population and n is the sample size.

  1. Central Limit Theorem holds true irrespective of the type of distribution of the population.
  2. Now, we have a way to estimate the population mean by just making repeated observations of samples of a fixed size.
  3. Greater the sample size, lower the standard error and greater the accuracy in determining the population mean from the sample mean.
Related image
Related image

Confidence Interval

The confidence interval is a type of interval estimate from the sampling distribution which gives a range of values in which the population statistic may lie. Take an example:

We know that 95% of the values lie within 2 (1.96 to be more accurate) standard deviation of a normal distribution curve. So, for the above curve, the blue shaded portion represents the confidence interval for a sample mean of 0.

Formally, Confidence Interval is defined as,

σ = the population standard deviation

For an alpha value of 0.95 i.e 95% confidence interval, z = 1.96.

Margin of Error : It is given as {(z.σ)/√n} and defined as the sampling error by the surveyor or the person who collected the samples. That means, if a sample mean lies in the margin of error range then, it might be possible that its actual value is equal to the population mean and the difference is occurring by chance. Anything outside margin of error is considered statistically significant.

And it is easy to infer that the error can be both positive and negative side. The whole margin of error on both sides of the sample statistic constitutes the Confidence Interval. Numerically, C.I is twice of Margin of Error.

The below image will help you better visualize Margin of Error and Confidence Interval.

The shaded portion on horizontal axis represents the Confidence Interval and half of it is Margin of Error which can be in either direction of x (bar).

Related image
Related image

Hypothesis Testing

Consider the following example :

Class 9th has a mean score of 40 marks out of 100. The principal of the school decided that extra classes are necessary in order to improve the performance of the class. The class scored an average of 45 marks out of 100 after taking extra classes. Can we be sure whether the increase in marks is a result of extra classes or is it just random?

Hypothesis testing lets us identify that. It lets a sample statistic to be checked against a population statistic or statistic of another sample to study any intervention etc. Extra classes being the intervention in the above example.

Hypothesis testing is defined in two terms — Null Hypothesis and Alternate Hypothesis.

Null Hypothesis being the sample statistic to be equal to the population statistic. For eg: The Null Hypothesis for the above example would be that the average marks after extra class are same as that before the classes.

Alternate Hypothesis for this example would be that the marks after extra class are significantly different from that before the class.

Hypothesis Testing is done on different levels of confidence and makes use of z-score to calculate the probability. So for a 95% Confidence Interval, anything above the z-threshold for 95% would reject the null hypothesis.

There are two types of errors that are generally encountered while conducting Hypothesis Testing.

Type I error: Look at the following scenario — A male human tested positive for being pregnant. Is it even possible? This surely looks like a case of False Positive. More formally, it is defined as the incorrect rejection of a True Null Hypothesis. The Null Hypothesis, in this case, would be — Male Human is not pregnant.

Type II error: Look at another scenario where our Null Hypothesis is — A male human is pregnant and the test supports the Null Hypothesis. This looks like a case of False Negative. More formally it is defined as the acceptance of a false Null Hypothesis.

T-test

A t-test is an analysis framework used to determine the difference between two sample means from two normally distributed populations with unknown variances. A t-test is an analysis of two population means through the use of statistical examination; analysts commonly use a t-test with two samples with small sample sizes, testing the difference between the samples when they do not know the variances of two normal distributions.

T-tests use the Sample Standard Deviation. The Sample Standard Deviation is given as:

where n-1 is the Bessel’s correction for estimating the population parameter.

T-values are dependent on Degree of Freedom of a sample.

The Degree of Freedom — It is the number of variables that have the choice of having more than one arbitrary value.

A t-test looks at the t-statistic, the t-distribution and degrees of freedom to determine the probability of difference between populations; the test statistic in the test is the t-statistic. To conduct a test with three or more variables, one must use an analysis of variance.

BREAKING DOWN ‘T-Test’

A form of hypothesis testing, the t-test is just one of many tests used for this purpose. Statisticians must use tests other than the t-test to examine more variables and tests with larger sample sizes. For a large sample size, statisticians use a z-test. Other testing options include the chi-square test and the f-test.

Example of a T-Test

For example, consider that an analyst wants to study the amount that people in Maharashtra and Tamil Nadu spend on clothing per month. It would not be practical to record the spending habits of every individual or family in both states. Thus, a sample of spending habits is taken from a selected group of individuals from each state. The group may be of any small to moderate size — for this example, assume that the sample group is 200 individuals.

The average amount for Maharashtra’s people comes out to Rs.500, and the average amount for Tamil Nadu is Rs.1,000. The t-test questions whether the difference between the groups represents a true difference between people in Maharashtra and people in Tamil Nadu or if it is likely a meaningless statistical difference. In this example, if all Maharashtra’s people spent Rs.500 per month on clothing and all people in Tamil Nadu spent Rs.1,000 per month, it is highly unlikely that 200 randomly selected individuals all spent that exact amount, respective to state. Thus, if an analyst or statistician yielded the results listed in the example above, it is safe to conclude that the difference between sample groups indicates a significant difference between the populations of each state.

What is a Chi Square Test?

There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:

· A chi-square goodness of fit test determines if a sample data matches a population.

· A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.

· A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.

· A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

Chi-square test is used when we have one single categorical variable from the population.

Let us understand this with help of an example. Suppose a company that manufactures chocolates, states that they manufacture 30% dairy milk, 60% temptation and 10% kit-kat. Now suppose a random sample of 100 chocolates has 50 dairy milk, 45 temptation and 5 kitkats. Does this support the claim made by the company?

Let us state our Hypothesis first.

Null Hypothesis : The claims are True

Alternate Hypothesis : The claims are False.

Chi-Square Test is given by:

where,

= sample or observed values

= population values

The summation is taken over all the levels of a categorical variable.

= [n *

] Expected value of a level (i) is equal to the product of sample size and percentage of it in the population.

Let us now calculate the Expected values of all the levels.

E (dairy milk)= 100 * 30% = 30

E (temptation) = 100 * 60% =60

E (kitkat) = 100 * 10% = 10

Calculating chi-square = [(50–30)²/30+(45–60)²/60+(5–10)²/10] =19.58

Now, checking for p (chi-square >19.58) using chi-square calculator, we get p=0.0001. This is significantly lower than the alpha(0.05).

So we reject the Null Hypothesis.

Coefficient of Determination (R-Square)

It is defined as the ratio of the amount of variance explained by the regression model to the total variation in the data. It represents the strength of correlation between two variables.

R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable. In investing, R-squared is generally considered the percentage of a fund or security’s movements that can be explained by movements in a benchmark index.

For example, an R-squared for a fixed-income security versus a bond index identifies the security’s proportion of price movement that is predictable based on a price movement of the index. The same can be applied to a stock versus the S&P 500 index, or any other relevant index.

R-squared values range from 0 to 1 and are commonly stated as percentages from 0% to 100%. An R-squared of 100% means all movements of a security (dependent variable) are completely explained by movements in the index (independent variable). A high R-squared, between 85% and 100%, indicates the stock or fund’s performance moves relatively in line with the index. A fund with a low R-squared, at 70% or less, indicates the security does not generally follow the movements of the index. A higher R-squared value will indicate a more useful beta figure. For example, if a stock or fund has an R-squared value of close to 100%, but has a beta below 1, it is most likely offering higher risk-adjusted returns.

The actual R-squared equation is calculated as:

R-Squared = 1 — (Explained Variation / Total Variation)

Correlation Coefficient

This is another useful statistic which is used to determine the correlation between two variables. It is simply the square root of coefficient of Determination and ranges from -1 to 1 where 0 represents no correlation and 1 represents positive strong correlation while -1 represents negative strong correlation.

The Real Life Applications of Statistics :

“Most people use statistics like a drunk man uses a lamp-post; more for support than illumination” — Mark Twain

Statistics are all around us. Every piece of knowledge we claim to know is backed up by statistics at some point. Statistics is the practice of collecting and analysing quantities of data, and can be used across many disciplines other than maths. Every time a survey is conducted, statistics will be used to organise the data collection, analyse the data, and interpret it in order to draw a conclusion. This could be a scientific study, a survey within the workplace, or any other situation in which data will be collected. Statistical skills and knowledge are very important in our daily lives, so it is important to make sure children are confident on this topic. Statistics worksheets are an excellent resource to help students master this important topic.

Knowing about statistics enables us to think critically about the world around us. A friend might claim that women cause 80% of car accidents, but knowledge of statistics will tell you not to trust that fact until you know more about the sample size, whether there was a bias in the sample, and where the data came from. Those who know about statistics know that correlation does not necessarily equal causality, and that tight analysis is needed to establish whether there is a causal link between two or more concepts.

Some of the main real life application of Statistics are :

1. Weather Forecasts

Weather forecasts use statistical models for predicting. The weather computer models are built using statistics that compare prior weather conditions with current weather to predict future weather.

2. Emergency Preparedness

What happens if the forecast indicates that a hurricane is imminent or that tsunamis or cyclones are likely to occur? Emergency management agencies move into high gear to be ready to rescue people. Emergency teams rely on statistics to tell them when danger may occur.

3. Predicting Disease

Lots of times on the news reports, statistics about a disease are reported. If the reporter simply reports the number of people who either have the disease or who have died from it, it’s an interesting fact but it might not mean much to your life. But when statistics become involved, you have a better idea of how that disease may affect you.

For example, studies have shown that 85 to 95% of lung cancers are smoking related. The statistic should tell you that almost all lung cancers are related to smoking and that if you want to have a good chance of avoiding lung cancer, you shouldn’t smoke.

4. Medical Studies

Scientists must show a statistically valid rate of effectiveness before any drug can be prescribed. Statistics are behind every medical study you hear about.

5. Genetics

Many people are afflicted with diseases that come from their genetic make-up and these diseases can potentially be passed on to their children. Statistics are critical in determining the chances of a new baby being affected by the disease.

6. Political Campaigns

Whenever there’s an election, the news organizations consult their models when they try to predict who the winner is. Candidates consult voter polls to determine where and how they campaign. Statistics play a part in who your elected government officials will be

7. Insurance

You know that in order to drive your car you are required by law to have car insurance. If you have a mortgage on your house, you must have it insured as well. The rate that an insurance company charges you is based upon statistics from all drivers or home-owners in your area.

8. Consumer Goods

Wal-Mart, a worldwide leading retailer, keeps track of everything they sell and use statistics to calculate what to ship to each store and when. From analyzing their vast store of information, for example, Wal-Mart decided that people buy strawberry Pop Tarts when a hurricane is predicted in Florida! So they ship this product to Florida stores based upon the weather forecast.

9. Quality Testing

Companies make thousands of products every day and each company must make sure that a good quality item is sold. But a company can’t test each and every item that they ship to you, the consumer. So the company uses statistics to test just a few, called a sample, of what they make. If the sample passes quality tests, then the company assumes that all the items made in the group, called a batch, are good.

10. Stock Market

Stock analysts use statistical computer models to forecast what is happening in the economy.

Summary :

We have now explored the two main categories of Statistics viz. Descriptive and Inferential Staistics. We have also seen the major theory along with practical implementations of various Inferential Statistics comcepts.

It is rightfully said, “Statistics is a science, not a branch of mathematics, but uses mathematical models as essential tools.

GreyAtom

GreyAtom is committed to building an educational ecosystem for learners to upskill & help them make a career in data science.

Rajesh S. Brid

Written by

GreyAtom

GreyAtom

GreyAtom is committed to building an educational ecosystem for learners to upskill & help them make a career in data science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade