The MBA Data Science Toolkit: 8 resources to go from the spreadsheet to the command line
The future of business belongs to people who can make
sense of large quantities of data — Hal Varian, Chief Economist at Google
I recently had the pleasure of speaking on a few panels about analytics to my fellow MBA students and alumni, as well as many Penn undergrads. After these talks, I’ve been asked for my advice on what the best resources are for someone coming from the business world (i.e., non-technical) who wants to develop the skills to become an effective data scientist. This post is an attempt to codify the advice I give and general resources I point people towards. Hopefully, this will make what I have learned accessible to more people and provide some guidance for those who realize that the future belongs to the empirically inclined (see below) but don’t know where to start their journey to becoming part of the club.
However, I would caution the reader that what I propose here is only a starting point on a journey towards really understanding the power of good data science. And, as Sean Taylor once told me, learn only what you need to accomplish your goal; if there are things on this list that you know you don’t need then skip them, you won’t hurt my feelings. At its core, data science is really about curiosity, optimism, and continual learning, all of which are ongoing habits rather than boxes to be checked. Therefore, I expect this list to evolve as the tools themselves change and as I continue to discover more about data science itself.
1. Linear Algebra
Linear algebra is a topic that underlies a lot of the statistical techniques and machine learning algorithms that you will employ as a data scientist. I like to recommend a MOOC I took through Coursera years ago, Coding the Matrix: Linear Algebra through Computer Science Applications. As the name implies, the course teaches linear algebra in the context of computer science (specifically using Python, which lends itself well to data science). There is also an optional companion textbook that makes a great reference manual.
Coding the Matrix: Linear Algebra through Computer Science Applications - Brown University …
Coding the Matrix: Linear Algebra through Computer Science Applications from Brown University. Learn the concepts and…
Given that we use R at Wealthfront, I have a few resources that I think are important here. The first, written by Garrett Grolemund and Hadley Wickham, R for Data Science will be published in physical form in July 2016 but is available for free online now. And rather than explain what the book is about in my own words, here are a few from the authors directly:
This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.
If you only read one data science book, it should be this
Next up, our friend Hadley has also written Advanced R, which covers functional programming, metaprogramming, and performant code as well as the quirks of R.
Welcome · Advanced R.
This is the companion website for "Advanced R", a book in Chapman & Hall's R Series. The book is designed primarily for…
Hadley is also responsible for some of the packages I use every day that make 90% of common data science tasks quicker and less verbose. I recommend checking out the following libraries; they will change the way you write code in R:
- ggplot2 — An implementation of the Grammar of Graphics in R
- devtools —Tools to make an R developer’s life easier
- dplyr — Plyr specialized for data frames: faster & with remote data stores
- purrr — Make your pure R function purrr with functional programming
- tidyr — Easily tidy data with spread and gather functions
- lubridate — Make working with dates in R just that little bit easier
- testthat — An R package to make testing fun
For extra credit, check out yet another of Hadley’s books: R Packages. This is a great follow-up resource for those of you that want to write reproducible, well-documented R code that other people can easily use (other people includes your future self!)
Welcome · R packages
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that…
This is probably the easiest section of the guide as you can teach yourself most of SQL in a few hours. Code School has both introductory and intermediate courses that you can get through in an afternoon.
Try SQL - Code School
Learn basic database manipulation skills using the SQL programming language.
The Sequel to SQL covers everything from aggregate functions and joins to normalization and subqueries. And while mastering these skills takes practice, you can still get an idea of what SQL can and cannot do without too much work.
SQL Tutorial - Code School
Learn the most important parts of the SQL language so you can create tables with constraints, use relationships, and…
4. Bayesian Reasoning
this book is probably one of the best all-around resources for learning how to do data science in R
Without wading into the age-old Frequentist vs. Bayesian debate (or non-debate), I think that a solid foundation in Bayesian reasoning and statistics is a crucial part of any data scientist’s repertoire. For example, Bayesian reasoning underpins much of modern A/B testing and Bayesian methods are applied in many other areas of data science (and are generally covered less in introductory statistics courses).
John K. Kruschke has a great ability to break down complex material and convey it in a way that is intuitive and practical. Along with R for Data Science, this book is probably one of the best all-around resources for learning how to do data science in the R programming language.
Additionally, Kruschke’s blog makes a great companion resources to the textbook if you’re looking for more examples of problems to solve or answers to questions you still have after reading the book. And if a textbook isn’t exactly what you’re looking for, then Rasmus Bååth’s research blog, Publishable Stuff, is another great resource for learning about Bayesian approaches to problem-solving.
I recently wrapped up a version of my R function for easy Bayesian bootstrappin' into the package bayesboot. This…
5. Machine Learning
this is my favorite MOOC of all time
While most data scientists use far less machine learning than most people would think, there are plenty of tools from this domain that can be applied to answer questions that less exotic approaches might struggle with. In fact, the most important lessons to take away from courses such as Andrew Ng’s Machine Learning course on Coursera are the strengths and weaknesses of various algorithms. Knowing the limitations of different approaches can save hours, or even days, of frustration by allowing you to avoid using the wrong tool to solve a particular problem. Andrew Ng is an example of another academic who has a gift for making the complex seem simple. This is my favorite MOOC of all time and is worth taking even if becoming a data scientist is low on your list of priorities.
Machine Learning - Stanford University | Coursera
Machine Learning from Stanford University. Machine learning is the science of getting computers to act without being…
Much of what you will build as a data scientist will be code, and code needs to be stored, tracked and deployed. Learning how to use a Distributed Version Control System (DVCS) such as Git will allow you do all of these things. More importantly, it will allow you to easily collaborate on code with your team and, in the context of the right engineering infrastructure, provide a level of protection from deploying irreversibly broken code.
Interactive Git Tutorial - Code School
Learn more advanced Git by practicing the concepts of Git version control. Increase your Git knowledge by learning more…
If you are new to the world of Git it can be confusing to understand but once you get it it seems super simple. The best courses I found to learn Git were from the team at Code School again. There’s probably a solid weekend’s worth of work here but trust me, it is worth the investment.
Advanced Git - Code School
Learn advanced Git by practicing Git version control with Git Real 2. Continue to increase your Git skills by learning…
Then there’s GitHub, a web-based Git repo hosting service. Understand the typical workflows associated with using a remote repository structure is critical. It makes everything that you’ll learn in Git Real 1 and 2 significantly more useful. By the time you’ve taken these 3 courses you’ll know more than you’ll probably ever need to about Git and GitHub.
GitHub - Code School
Learn advanced tips, tricks, and proven best practices for collaborating more effectively with GitHub.
I’m using Haskell here as a stand in for functional programming, and Learn You a Haskell for Great Good! is one way to learn it. In the words of Roberto Medri:
I think it’s important to have a functional language understanding in order to use R functionally in a more conscious way. Learn You a Haskell is the best investment I’ve ever made in terms of reading a programming book. And I wrote exactly zero Haskell programs in my life beside its exercises.
While there are many languages out there that are well-suited to the functional paradigm, Haskell has a book that makes the language and functional programming incredibly simple. Learn You a Haskell is really entertaining to read and the exercises really help you understand what you are doing.
Learn You a Haskell for Great Good!
Hey yo! This is Learn You a Haskell, the funkiest way to learn Haskell, which is the best functional programming…
I think most data scientists would agree this is one of the most important skills in the toolkit. Taking the maths, statistics, modeling and coding that go into good data science and learning something new about the world or generating some novel insight can be wasted if you aren’t able to effectively communicate it to others. The most powerful tool we have to effectively convey information is visualization, without which data scientist would be somewhat useless.
There are many great writers on this topic and, therefore, many great books, so I don’t mean to claim that this particular recommendation is either superlative or exhaustive. That said, Now You See It, by Stephen Few is a fairly comprehensive overview of the theory behind, and practical application of, conveying quantitative information through visual media. It’s a resource that I have found myself coming back to time and time again when deciding how to display data or communicate information.
I hope these resources can provide a roadmap to help other people bridge the gap between the technical and business domains that data science within. However, while knowing maths and statistics and being able to write code are all crucial to being a data scientist, these things are just tools that merely enable the deep work that constitutes a lot of the typical day in data science land.
In fact, developing these skills is not the hardest part of becoming proficient in data science. Learning to define feasible problems, coming up with reasonable ways of measuring solutions and, believe it or not, storytelling, are some of the less concrete, but certainly more challenging, aspects of data scientists that I’ve had to get better at. These skills come from practice, from making mistakes, and learning from them as you progress in your career. For more insight, Yanir Seroussi has a great blog post that I think sums this up well.
The hardest parts of data science
Contrary to common belief, the hardest part of data science isn't building an accurate model or obtaining good, clean…
Lastly, there are three traits that people in data science seem to possess in varying but significant proportions: genuine curiosity, optimism in the face of uncertainty, and a desire to learn. I don’t think there is a book or a MOOC to teach these things but if you have them then you can learn the rest. I hope this guide can be a starting point for others that chooses to do so.