How to be a Data Scientist

The other day I was talking with my friend Connor, and he was curious about an optimization class he wanted to do. He was wondering what sort of background he would need to do well in optimization, and how that played into the machine learning community as a whole.

So why listen to me?

I’m just some guy on the internet. But I have a degree in math and another in stats, I’ve worked on experimentation software in some form or another for 2.5 years, and now I’m doing AI research. That being said, my background is more academic, and I love structure and theory, so that definitely colors my opinion on the subject.

What is a data scientist?
A data scientist is simply a person who is good at statistics as well as programming and does something that requires both. That’s it.

So what topics are there, and how do they relate?

Some justification for weird things on there:

Real analysis — Honestly if there was one subject that I think is systematically undervalued for data scientists, it’s real analysis. I think this gets undervalued because most data scientists usually don’t come from academia, and if they do, it’s from the CS realm. Real analysis is a pretty theory heavy branch of math, but it is the theory that we care about — it’s the theory that is trying to describe all the quantitive things you do in data science. It is essentially the backbone of all applied math (optimization, linear alg, stats, prob). In real analysis you study things like what it means for two objects to be close (how do we know our model’s estimated parameters are close to the ‘true’ parameters?), what it means for a function to converge to another function (how do we know our estimated distribution gets close to the real distribution? How far off of the real distribution is our estimate?), you learn why making the distinction of discrete random variables and continuous random various is stupid (its because you study intro probability theory with the wrong type of integral! — it turns out most probability and statistics boils down to integrating things, and estimating those integrals), functional approximation (how do I approximate my function as a polynomial, with sines and cosines, complex exponentials, with exploding polynomials [okay, thats actually complex analysis], on Fourier space. How do I know the types of functions that I can use to approximate another function, and what will the error be?)

OS/compilers — OS is on there because I think people who like data science would like OS, and it’s a ton of fun. It’s basically how do I manage all the resources on a computer in a coherent way? I would describe it as operations research for a computer, and operations research is just data science for business. Compilers teaches you basically how to write awesome code, and what as a programmer you need to do and what you don’t need to do, which is needed for people who want to do high-performance stats or scientific computing.

Programming Languages — Most of the time you will be using python, R, SAS, matlab. And the fact is, while those are fine for the occasional homework project, they suck for writing large projects. PL gives you exposure to the wide variety of CS stuff out there, and how to pick the best language for the best project. It also exposes you to hidden gems you wouldn’t necessarily come across out in the wild, like Prolog or Church [and probabilistic programming seems pretty awesome].

Some nice resources I’ve found:

Real analysis:
Hard, but good — http://www.maa.org/publications/maa-reviews/principles-of-mathematical-analysis

OS — http://pages.cs.wisc.edu/~remzi/OSTEP/

Deep Learning — https://www.coursera.org/course/neuralnets

Robotics (I haven’t actually tried this one, but it seem cool) — http://ocw.mit.edu/courses/aeronautics-and-astronautics/16-412j-cognitive-robotics-spring-2005/index.htm

AI — http://aima.cs.berkeley.edu/

ML — http://research.microsoft.com/en-us/um/people/cmbishop/prml/

Blogs/papers —
http://matt.might.net/
http://colah.github.io/
http://worrydream.com/refs/Shannon%20-%20A%20Mathematical%20Theory%20of%20Communication.pdf