What should I learn in Statistics to cooperate Machine Learning and Data Analysis?

Zhaojun Zhang
2 min readOct 23, 2016

--

Cross posted from my Quora answer: https://www.quora.com/What-should-I-learn-statistics-to-cooperate-machine-learning-and-data-analysis

Here is a list of things that I would recommend. You are unlikely to appreciate the materials below immediately without systematically going through statistics training.

  • Understand basic concepts in statistics: probability, density, pdf, common distributions like normal, binomial, etc. Lots of literatures are written in the standard statistics language, if you don’t understand basics, you won’t be able to read papers and communicate with other people in the community.
  • Understand trade off between bias and variance. A great understanding of bias and variance helps you to figure out what’s the right action to reduce errors. e.g., if you believe the bias is a dominant factor in the errors, you might want to choose to switch to a different model, if you believe the variance is, you might want to focus on collecting more data.
  • Understand Central limit theorem. This is a great theorem alone, not mentioning how it affects many hypothesis testing methods.
  • Understand linear regression and logistic regression. Lots of problems in ML or data science can be addressed by these two methods or their variants.
  • Understand Maximum likelihood estimation. This is a common approach for point estimation.
  • Understand optimization methods like gradient decent. Optimization is commonly used in point estimation in statistics, and it also widely used in many machine learning algorithms.
  • Understand Bayes’ theorem Another fundemental theorem in statistics.
  • Understand Multilevel model. It helps you to develop a taste of building complex models instead of just simply applying linear regression or logistic regression.
  • Understand sampling approaches like Gibbs sampling. Commonly used as a Bayesian approach for sample posterior distributions of latent variables. If you are building complex models on your data, chances of using sampling approach is very very high.

--

--