Statistics for Data Science

Irfan Rahman
Beginner’s Guide for Data Science
11 min readJan 22, 2019
Statistics for Data Science

Before taking steps let me tell you this is my first blog ever on internet. So, i am excited as well as quite nervous obviously :). Last few months i have been doing lot’s of digging and research on internet and i found various blogs and post which helped me a lot as well as inspired me and wanted to share my knowledge for those who wanted to start their career in Data Science like me.

The idea behind to writing this blog i.e. “Statistics for Data Science” which covers most of the concepts of Statistics which is required by anyone who wants to start their career in Data Science.

This post contains below topics-:

  • Data Types(Quantitative , Qualitative)
  • Statistics(Descriptive, Inferential)
  • Moment of Business Decision(Central tendency, Dispersion, Skewness, Kurtosis)
  • Sampling Funnel(Population, Sample Frame, SRS, Sample)
  • Normal Distribution
  • Confidence Interval- Redirect to below link.
  • Hypothesis Testing(Null Hypothesis, Alternate Hypothesis)

Data Types

Data types take place very vital role in statistics. To apply correct statistically measurement, we should understand what type of data we are dealing with. Otherwise, you will lead to wrong assumption. If you really have good understanding of data types you will know what kind of visualization or graph will be better. For example bar graph need categorical type of data on x-axis and on other side histogram need a continuous/discreet data on x-axis. Because visualization will change according to the data types. I only spoke about how data visualization changes depending upon data type. Same goes for statistical measurement. I hope you will understand somehow why data type is so important.

While doing Exploratory Data Analysis(EDA i.e. performing initial investigation on data) it’s important to know the data types. There are many statistical measurement you will do while EDA process. But you can’t apply statistical measurement on all the data type because it is specific with data types.

Let’s have a look on Data types-:

Data Types

1. Quantitative-: Numeric value which can be measure or it arises when observation are fall into counts or measurements. In simple term which can be measured or count.

a. Continuous-: Values between two intermediate is infinite Or there are no intermediate value is possible then that measurement is called Continuous. Example-: height, weight.

b. Discreet-: Values between two intermediate is finite or no intermediate value. Example-: number of cars etc.

2. Qualitative-: It arises when observations fall into separate distinct categories or you can say which can’t be measured. Example-: Color of eyes-blue, black, green etc., Gender-male, female etc.

a. Nominal-: If there is no natural order between the categories and thus gives only names labels to various categories. Example-: Gender

b. Ordinal-: If the order exist. Example-: Results, Rating.

Statistics

As you all know it is called branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation. So this is all about Statistics definition that you have already known and here in this post our idea is how statistics is important for Data science. Statistics is one of the major pillar for Data science. Statistics is very powerful tool when we really wanted to know the insight of Data.If you want high level view statistics plays very vital role in mathematics for analysis. There are many graph you can plot and get the information about data, but if you really wanted to get insight of data statistics you need statistical measurement.

Two major branches or broad categories-:

  • Descriptive Statistics-: It is use to summarize and graph the data that we choose. This process will allow us to understand the specific set of observation. It’s a pretty straight forward. You simply take a group in which you are interested in, record the data about the group members, then use summary statistics and graphs to present the group properties. With descriptive statistics there is no uncertainty, because we are describing only the people and item that we actually measure. We are not trying to infer properties about a larger population.
  • Common tools used in descriptive analysis-:
  • Central Tendency (Mean, Median, Mode)
  • Dispersion (Variance, SD, Range)
  • Skewness
  • Inferential Statistics-: It takes the data from Sample and makes inferences about the larger population from which the sample was drawn. Because the goal of inferential statistics is to draw the conclusions from a sample and generalize them to a population, we need to have confidence that our sample accurately reflects the population.
  • Standard analysis tools-:
  • Hypothesis Test
  • Confidence Interval
  • Regression Analysis
Major branches of Statistics

Moment of Business Decision

  1. Measurement of central Tendency

It is a summary statistics that represent the center point or typical value of a data set. These measures where most values in a distribution fall and are also referred as central location of a distribution.

Mean/Average-: The mean is the sum of the value of each observation in a data set divided by the number of observation.

  • Advantage of mean-: It can be use for both continuous and discreet.
  • Disadvantage of mean-: The mean can’t be calculated for categorical data, as the values can’t be summed and the mean includes each and every value in the distribution. So it is influenced by outliers and skewed distribution.
  • When to use mean-: Symmetric distribution, Continuous Data.
  • Bonus-: if mean and median are closer it doesn’t imply their is no outliers.

Median-: The median is the middle value in the distribution when the values is arranged in ascending order and descending order.

  • Advantage of median-: The median is less affected by outliers and skewed data than the mean and is usually preferred measure of tendency when the distribution is not symmetrical i.e. skewed. Because median is more robust for outliers.
  • Disadvantage of median-: The median can’t be identified for categorical nominal data, as it can’t be logically ordered.
  • When to use median-: Skewed distribution, Continuous data, Ordinal data.
  • Bonus-: When you have skewed distribution median is a better distribution.

Mode-: The mode is most commonly occurring value in the distribution.

  • Advantage of mode-: The mode has advantage over the median and the mean as if can be found for both numerical and categorical data.
  • Disadvantage of mode-: They are some limitations to using the mode. In some distributions, the mode may not reflect the center of the distribution very well. When the distribution of retirement age is ordered from lowest to highest value, it is easy to see that the center of the distribution is 57 years, but the mode is lower, at 54 years. 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60. It is also possible for there to be more than one mode for the same distribution of data, (bi-modal, or multi-modal). The presence of more than one mode can limit the ability of the mode in describing the center or typical value of the distribution because a single value to describe the center cannot be identified. In some cases, particularly where the data are continuous, the distribution may have no mode at all (i.e. if all values are different). In cases such as these, it may be better to consider using the median or mean, or group the data in to appropriate intervals, and find the modal class.
  • When to use mode-: Categorical data, Count data, Ordinal data.

Bonus which one is best Mean, Mode, Median-: When you have a symmetrical distribution for continuous data, the mean, median, and mode are equal. In this case, analysts tend to use the mean because it includes all of the data in the calculations. However, if you have a skewed distribution, the median is often the best measure of central tendency. When you have ordinal data, the median or mode is usually the best choice. For categorical data, you have to use the mode.

2. Measurement of Dispersion-: Dispersion in statistics is a way of describing how spread out a set of data. When the data set has a large value, the values in set are widely scattered: when it is small the items in the set are tightly clustered. How far away are the independent value from measure of central tendency?

  • Variance-: It measures how far a data set is spread out. The technical definition is “The average of the squared difference from the mean”, but all it really give you a very general idea of the spread of your data. It is the measurement of error.
  • Standard Deviance-: It measures how far a data set is spread out. The technical definition is “The square root of variance”. The standard variance is more concrete giving you the exact distances from the mean.
  • Range-: The difference between the largest and smallest in the observation in data. Represent by Capital “R”.

3. Measurement of Skeweness-: A measurement of asymmetry in the distribution. If data is concentrated to left then called positive/right skewness and data is concentrated to left then it is called negative/left skewness. It is a measure for left and right.

Measurement of Skewness
  • Normal/Symmetric Distribution-:Here, in normal distribution most of the value clustered in middle of the range and rest symmetrically towards either extreme.Technically, we can say that If mean = median = mode then it is called Normal distribution.
  • Positively/Right Skewed-: If the distribution is skewed to right which means most of the data fall on left side of distribution. Don’t confuse with definition yeah it seems opposite, only you have to see the tails. If you see in positive skewed tails skewed to right and that’s the reason it is called positively/right skewed. In right skewed mean > median > mode
  • Negatively Skewed/Left Skewed-: It is vice-versa of Right skewed. As i said you just have look for tails. Here in left skewed tails skewed to left. Thus it is called negatively/left skewed. In right skewed mean < median < mode

4. Measurement of Kurtosis-: A measure of peakedness or degree of tailedness in the distribution. Remember, a positive value tells you have heavy tails(i.e. a lot of data in your tails) and a negative value tells you have light tails(i.e. a little data in your tails).

Types of Kurtosis
  • Mesokurtic Curve-: This distribution are technically defined as having a kurtosis of zero, although the distribution doesn’t have to be exactly zero in order for it to be classified as mesokurtic. The Most common mesokurtic distribution are-:

a.The normal distribution.

b.Any distribution with a Gaussian (normal) shape and zero probability at other places on the real line.

  • Leptokurtic Curve-:This distribution has excess positive kurtosis. The tails are fatter than the normal distribution.
  • Platykurtic Curve-:This distribution have negative kurtosis. The tails are very thin compared to normal distribution.

Sampling Funnel

In real scenario you will never get access of total population to perform analysis. If you will get then you are luckiest human on earth because you can play with data as you want. As i said it’s not possible to get entire population. To deal with Sampling funnel technique came.

Let’s have a look on sampling funnel technique.

  • Population-: Population is the entire data in the universe that satisfy a specific criteria.
  • Sampling Frame-: The source of Information. It sounds simple but it is very important like information shouldn’t be biased, how and from whom information got collected etc.
  • Simple Random Sampling(SRS)/Blind folded sampling-: It is a subset of a statistical population in which each member of the subset have equal probability of being chosen.

Advantage-: It is consider as a fair way of sample selection because every member of the population has an equal chance of getting selected.

Disadvantage-: A sampling error can occur with a simple random sampling if the sample does not end up accurately reflecting the population it is supposed to represent.

  • Sample-: It is a subset of containing the characteristic of large population. Sample should speak for Population. Always remember sample should be random and not be biased.

Normal Distribution

It represent the behavior of most of the situations of the universe. It is always for the Population only. This is most commonly used distribution, which is used frequently in finance, investing, science, and engineering. This is fully characterized by its mean and standard deviation, meaning the distribution is not skewed and doesn’t exhibit kurtosis. This makes distribution symmetric and it is depicted as bell shaped curve when plotted. A normal distribution is defined by a mean (average) of zero and a standard deviation of 1.0, with a skewness of zero and kurtosis = 3 (is called standard normal distribution).

In a normal distribution, approximately 68% of the data collected will fall within +/- one standard deviation(σ) of the mean, approximately 95% within +/- two standard deviations(σ) and 99.7 % within three standard deviations(σ).

Any distribution is called normal distribution if it has the following characteristics.

Values are between –infinite to +infinite

Area under curve is always 1

Probability of a single random value is always 0.

For continuous data.

Appears like a bell shape.

Defined by mean, standard deviation

Standard Normal distribution- normal distribution can be converted to std. normal distribution using (x-mu)/sigma called z score

Exactly half of the values are the left of the center and the other half to the right.

Median will be greater than Mode but less than Mean

Now we have reached to end of this post but i wanted to share with you a quote said by greatest Economist ever “Ronald Coase”-:

If you torture the data long enough, it will confess to anything.

I hope my post will help you and stay tuned for upcoming topics.

--

--