Essential Probability & Statistics concepts before Data Science

Rahul Bhatia
3 min readOct 19, 2019

--

Statistics can be confusing when starting out, there’s so much to learn!

Hi folks, a very common questions people have before starting out on a Data Science journey is how much of math do I need? In general, you need to be well versed with 4 domains of mathematics, that I like to think of as pillars which form a concrete base of being a Data Scientist —

  1. Linear Algebra — used quite frequently in traditional Machine Learning algorithms like Logistic Regression, SVM, etc. Also in unsupervised techniques such as PCA.
  2. Calculus — The backbone of Deep Learning, particularly the driving force behind the famous backpropagation, that’s the force behind almost all of the neural networks that work like magic for us today.
  3. Probability and Statistics — form the basis of Data Science and Data Analysis
  4. Matrices(that can also be included in Linear Algebra) — have a wide usage in Recommender Systems.

The aim of this article is to mention very brief topics to know at least before starting out with Data Science. This by no means is an exhaustive list and that is not even possible to make one, because Statistics is a huge subject in itself and people spend many years to gain expertise in the same. All of the below topics I mention below I think are personally very important to learn before getting your hands dirty to understand your data in a much better way. So without further ado, let’s get started.

Statistics at a very broad level can be divided in 2 categories —

  • Descriptive Statistics — Calculation and analysis of various statistical moments of the underlying data such as mean, variance, quantiles etc. In other words, they are brief coefficients that summarize your data.
  • Inferential Statistics — Making an inference on population by examining a sample of that population.

In Descriptive Statistics and Probability, one should know in-depth about at least the following —

  • Measures of Central Tendency — Mean, Median, Mode
  • Measures of Variability — Variance, Standard Deviation
  • Percentiles and Quantiles — Interquartile Range and Mean/Median absolute deviation
  • Scatter Plot
  • Histograms
  • PDF(Probability Density Function)
  • CDF(Cumulative Distribution function)
  • Box Plots
  • Pair Plots(just an extension of scatter plots — check Seaborn)

As far as Inferential Stats is concerned, it is essential to have a thorough knowledge of the underlying topics —

  • Population and Sample
  • Bias in Sampling
  • Gaussian/Normal Distribution(this is very important)
  • Standard Normal Distribution
  • Z-Statistic and Standardization
  • T-Statistic and its application in Confidence Intervals(Student’s T-Distribution)
  • Skewness and Kurtosis of a distribution
  • Central Limit Theorem(one of the most fundamental theorems in not only statistics but whole of mathematics)
  • Discrete and Continous Distributions mentioned below —
  1. Binomial and Bernoulli Distribution
  2. Poisson Distribution
  3. Uniform Distribution(both Discrete and Continous)
  4. Normal Distribution
  5. Exponential Distribution
  6. Log-Normal Distribution
  7. Power Law Distribution
  • Box-Cox transform
  • Kernel Density Estimation
  • Box-Cox transformation
  • Chebyshev’s inequality
  • Correlation Coefficients(Pearson, Spearman, Chi-Squared)
  • Point Estimates and Confidence Intervals
  • Hypothesis testing
  • p-values
  • Q-Q Plot
  • KS-Test
  • Chi-Squared Analysis

At this moment, these are the most essential concepts I can recall that should be enough to get you started and make sense of the data statistically, both using Descriptive Statistics as well as Inferential Statistics. I will surely update the list if I remember something other than what I have mentioned here. Before departing, I would like to mention once again that this by no means is an exhaustive list. This is something which will definitely get you started and is sufficient to learn what is needed subsequently as all of this forms the basis of much more advanced concepts.

Thanks for reading and if I missed something, please drop a response and I will update this list. I wanted to keep it minimal and include only what is truly necessary, otherwise statistics is a vast subject and all of the above listed here is a very small part of the same.

--

--

Rahul Bhatia

Data Science @ Fidelity Investments | ex - CRED, Rakuten