A low-cost & high-quality guide to begin your own journey.
First, what is Data Science? Wikipedia claims it to be:
“Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.”
In this domain, there are several job roles and subfields forming as it evolves. I imagine that over time, saying you ‘do data science’ will be equivalent to saying you ‘do science’ or ‘do programming’. It’ll require more nuance and specification to accurately mean anything. For our current world, I think it is imperative we leave the definition a little broad and provide as many entry ramps into this world as possible. This is because the evolution of the Information Age has made learning from data and making it as useful as possible in the space you operate in the best skill you can adopt.
And the best part? You can adopt it for free. Or quite close.
Here’s how I would recommend to go about your learning journey:
Find a Domain of Interest -> Programming -> Application in Domain -> Math/Statistics -> Application in Domain -> Iterate with Feedback
Wherever you’re coming into this from, if you don’t know how to write SQL and Python/R self-sufficiently at an intermediate/expert level then I’d recommend to start with these foundations. This is because whether in academia or industry, the work is applied through programming and often on very large datasets (ones that would make Excel crash). An inability to code would be like attempting to drive a car with an inability to use the gas pedal & gear stick. My expertise is in Python and I absolutely love it, so I’ll only recommend Python resources here:
Courses (not all free, but lowest-cost high-quality options I’ve found):
- Introduction to Python Programming — Udacity: Free
- Python 3 Specialization — University of Michigan on Coursera (5 month-long courses): $50/month, ~$250 total, free to audit
- Computational Thinking using Python XSeries Program— MITx on edX (5 months long est.): $150 total, $135 current discounted price
- Applied Data Science with Python — University of Michigan on Coursera (5 month-long courses): $50/month, ~$250 total, free to audit
- Intro to SQL: Querying and managing data — KhanAcademy: Free
- Automate the Boring Stuff with Python by Al Sweigart
- Effective Python: 90 Specific Ways to Write Better Python, 2nd Edition by Brett Slatkin
- Fluent Python, 2nd Edition by Luciano Ramalho
- Python for Data Analysis, 2nd Edition by Wes Mckinney
- Learning SQL, 3rd Edition by Alan Beaulieu
Now let’s get to the math/stats stuff. Firstly, how much math/stats is really used on the job? Well it depends on the job, company, and interest one has to use those skills. For the most part, a lot of linear algebra concepts are used when manipulating and transforming large datasets. Statistics and probability come up a lot as well, to name a few of the “beginner” concepts: descriptive statistics, imputing missing data, and A/B tests. The deeper you dive into both of these worlds, you can easily find yourself doing linear programming, optimization, working with eigenvalues/eigenvectors, causal inference, probabilistic programming, and more which all require a strong grasp of the math & statistics world. Essentially, working with data usually doesn’t require you to write proofs or invent new theorems for things, but it does require you to know what works, how to use what works (correctly), when it goes wrong, and be able to explain the why of what you did to someone who has never heard of it before. Since this is more catered for the person looking to jump-start their journey, I’ll recommend content at the beginner/intermediate level:
Courses (not all free, but lowest-cost high-quality options I’ve found):
- Statistics with Python Specialization — University of Michigan on Coursera (5 month-long courses): $50/month, ~$250 total, free to audit
- Fundamentals of Statistics — MITx on edX: Free
- Introduction to Statistics: Probability — Berkeley on edX: Free
- Essence of linear algebra — 3Blue1Brown on Youtube: Free
- Mathematics for Machine Learning: Linear Algebra — Imperial College London on Coursera (1 month long course): $50/month, free to audit
- Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong
- Practical Statistics for Data Scientists, 2nd Edition by Peter Bruce, Andrew Bruce, and Peter Gedeck
- A Common-Sense Guide to Data Structures and Algorithms by Jay Wengrow
- Bayesian Analysis with Python-Second Edition by Osvaldo Martin
- Math for Programmers by Paul Orland
There is an overwhelming amount of resources out there to learn from and level yourself up with in the data space. This can be a big positive, but can also be debilitating when it comes time to make a decision. I purposely only chose 5 courses/books with the hopes that it helps you make a decision. My criteria for the courses/books I put in here are low-cost, high-quality options you can start from wherever you are and learn applied skills that get used often in academia and industry. I left out the courses that dive too deep into theory or the absolute cutting-edge of this field (although there are some exciting things happening there) just to ensure you can be equipped with skills that are readily applicable in your world soon after completion. I’ve completed or casually audited nearly all the options I’ve shared, so I personally can vet that they are fantastically taught and supported. As long as you make sure to do your part by doing the exercises alongside the courses and actually practice the concepts you’re learning, I’m sure you’ll be quite skilled by the end of it.
If you just wanted to take a deep dive into the more advanced concepts of Machine Learning, here are some excellent courses to start with:
- Machine Learning — Stanford on Youtube: Free
- Probabilistic Machine Learning — Phillip Henning on YouTube: Free
- Deep Learning Specialization— DeepLearning.AI on Coursera (5 month-long courses): $50/month, ~$250 total, free to audit
- Causal Inference — Columbia University on Coursera (6 week long course): $50/month, free to audit
- PyTorch Basics for Machine Learning — IBM on edX: Free
One thing to remember as you embark on your journey into this field is that ‘beginner’ probably isn’t what you think it is. The low bar for excellence can still be quite high, but as long you keep iteratively increasing that bar of ‘making data useful’ in your domain then you’re on the right track. This article doesn’t cover getting a Graduate degree in this field (Computer Science, Statistics, Mathematics, Data Science, etc.), but as a graduate from the Master of Applied Data Science program from University of Michigan I can personally vouch for the value of the structure and speed that option can provide for your journey if you do it right. It’s not necessary, but it has given me what I was looking for in my journey: a strong grasp of foundations, an advanced set of data skills to use in projects I’m interested in, and an inspiring/supportive community to be a part of. However you’re able to, if you consistently focus on getting & reinforcing those 3 things then you’ll advance your data journey quite significantly.