Elementary Statistical Terms for Data Science Interviews

Random Nerd
6 min readApr 21, 2018

--

Mean, Median & Mode:

There are 3 measures of Central Tendency

  • Mean (average) : Sum of all the values divided by the number of values. Least Robust.
  • Median (‘middle’ value) : The value that is in the middle when all of the values are arranged in ascending order. If there is an even number of values there is no single middle value. Therefore, we take the arithmetic average of the two middle values. Robustness is in between mean and mode.
  • Mode (most frequent value) : Value that appears the highest number of times. Most Robust.

If you are given the list of values 1, 3, 3, 5, 7, 10. Now let us find Mean, Median & Mode. So, Mean = (1+3+3+5+7+10) / 6 = 4.83, Median = average of the 2 middle values since there is an even number of values. (3+5)/ 2 = 4 & Mode = most frequent value = 3.

If we take the list of values from the previous question but now add an additional value of 100 how does the mean, median, and mode change? Mean= (1+3+3+5+7+10+100) / 7 = 18.43, Median = middle value = 5 & Mode= most frequent value = 3.

Standard Deviation:

Standard deviation (σ) measures how much the values in a data set differs from the mean. In other words, standard deviation measures dispersion or variability in a set of values. A data set with mostly similar values has a small standard deviation, while a data set with very different values has a large standard deviation. Standard deviation changes with changes in sample size (number of values or participants). With small sample sizes, random chance has a bigger impact and therefore standard deviation for a small sample size is generally larger. Studies with more values generally have smaller standard deviations as chance plays less of a role.

Classification on 2 by 2 Tables (TP, FP, TN & FN):

Situations involving TP, FP, TN, and FN will usually have a two-by-two table. Sometimes they give us the actual table and other times they shall give us all of the data for the table in sentence form and we have to make the table. The top left box in a two-by-two table may not always represent TP (True Positive). Sometimes the order of the columns and rows are jumbled in presentation. Suggest format is TP, TN, FP, and FN. Let’s assume a medical health scan scenario, where we’re trying Computer Vision techniques to classify:

  • True Positive (TP) : A diseased person who is correctly identified as having a disease by the test.
  • False Positive (FP) : A healthy person that is incorrectly identified as having the disease by the test.
  • True Negative (TN) : A healthy person who is correctly identified as healthy by the test.
  • False Negative (FN) : A diseased person who is incorrectly identified healthy by the test.

Sensitivity and Specificity:

We use Sensitivity and Specificity to determine whether or not to use a certain test or to determine what situations a certain test would work best in. It is important to note that Sen and Spec are fixed for a certain test as long as we don’t change the cutoff point. Therefore, Sensitivity & Specificity are not affected by changing prevalence. Both are given as a percentage ranging from 0% to 100%.

Sensitivity is the percentage of patients with the disease that receive a positive result or the percentage chance that the test will correctly identify a person who actually has the disease: Sensitivity = TP / (TP + FN) or Sensitivity = TP / Diseased.

Specificity is the percentage of patients without the disease that receive a negative result: Specificity = TN / (TN+FP) or Specificity = TN / Not Diseased.

Imagine we have 2 very different guns. The first gun fires when we barely touch the trigger. A strong gust of wind could set it off. The first gun has high sensitivity and low specificity. It is sensitive to the smallest of signals to fire while not being very specific to an intentional pull of the trigger. We never miss a possible chance to shoot our gun (~ Low FN), but we often accidentally fire when we shouldn’t (~ High FP). The second gun only fires if we pull the trigger really hard. This gun has high specificity and low sensitivity. It is very specific to firing only when we intentionally pull the trigger (~Low FP), but it isn’t very sensitive to a weak pull of the trigger (~High FN).

Validity and Bias:

Validity is how well the test/study answers the question it was supposed to answer. With regard to laboratory test results we would use sensitivity and specificity to measure validity. However, the term validity is more commonly used when referring to research. It is basically how valid the conclusions of the study are based on the study’s design and results. There is internal validity which measures how well your results represent what is going on in the sample being studied and external validity which measures how well our results can be applied to other situations (or the overall population).

Bias is a non-random (directional) deviation from the truth. High bias in a study means low validity and vice versa. Bias is a problem that causes us to consistently get distorted results. These results are non-random as we are consistently having the results skewed in the same direction. In most cases this means we are showing a stronger association between the factor being studied and the health outcome. Bias is different than the random error we might observe with a low sample size. Bias means there is something fundamentally wrong causing us to get incorrect results that are consistently different than the truth. We can’t correct bias by having a larger sample size.

Confidence Interval :

A Confidence Interval (CI) is the range of values the true value in the population is expected to fall within based on the study results. The results we receive in any study do not perfectly mirror the overall population and the confidence interval lets us get a better idea of what the results in the overall population might be. The confidence interval is based on a certain level of confidence.

Better not to get confused with value of the sample population! If the measured BMI (Body Mass Index) in 100 people in our study population and the mean is 25 than you are very confident that the actual mean BMI in that group is 25. Confidence interval only comes into play when you try to extrapolate your study results to other situations (like to the population overall).

Null Hypothesis & Alternative Hypothesis:

When looking at 2 or more groups that differ based on a treatment or risk factor, there are two possibilities:

  • Null Hypothesis (Ho) = No difference between the groups. The different groups are the same with regard to what is being studied. There is no relationship between the risk factor/treatment and occurrence of the health outcome. By default assume the null hypothesis is valid unless we have enough evidence to support rejecting this hypothesis.
  • Alternative Hypothesis (Ha) = Difference observed between the groups. The groups are different with regard to what is being studied. There is a relationship between the risk factor/treatment and occurrence of the health outcome

p-Value:

p-value is the probability of obtaining a result at least as extreme as the current one, assuming that the null hypothesis is true. Imagine we did a study comparing a placebo group to a group that received a new blood pressure medication and the mean blood pressure in the treatment group was 20 mm Hg lower than the placebo group. Assuming the null hypothesis is correct the p-value is the probability that if we repeated the study the observed difference between the group averages would be at least 20.

We may wonder what determines whether a p-value is ‘low’ or ‘high’. That is where the selected ‘Level of Significance’ or Alpha (α) comes in. Alpha is the probability of making a Type I Error (or incorrectly rejecting the null hypothesis). It is a selected cut off point that determines whether we consider a p-value acceptably high or low. If our p-value is lower than alpha we conclude that there is a statistically significant difference between groups. When the p-value is higher than our significance level we conclude that the observed difference between groups is not statistically significant.

Hope that helped and thank You for your time! :)

--

--