Data Science Guide

Rationale behind curriculum selections

P1xt
P1xt’s Blog
5 min readSep 29, 2017

--

First, an apology for those of you who were patiently waiting for the past 5 months while I refined down to just the resources that best supported the knowledge and skills I wanted the Data Science Guide to convey. It is a huge field, with more available resources than even I imagined, and winnowing it down to just the list that supported a thorough understanding of the material without too much redundancy or “time wasters” ended up being a pretty monumental task. Finding enough material wasn’t tough — weeding through everything and paring it down to just what’s needed was the challenge.

My paramount goal wasn’t “write a Data Science Guide” — it was “put together a sequence that would prepare someone to confidently understand and use data as typical of Data Science professionals”. To that end, I picked three common areas of interest (Bioinformatics, Machine Learning, and Artificial Intelligence) and then built up the foundation to support an understanding and facility in those areas.

Yes — there’s Math

Of course, there’s plenty of Math: some Calculus, Linear Algebra and Differential Equations, and even more Statistics and Probability — and I believe every last bit of it is warranted. It’s the difference between randomly plugging data into functions that you don’t really understand, and confidently manipulating data to support the way you want to use it.

There is also Science

What may be more surprising to some is the inclusion of science in the path. You’ll find a fair amount of Biology, plus some Chemistry and Physics. All of the science directly supports the three main arenas of Data Science I chose to explore: Biology and Chemistry support an understanding of the data useful to Bioinformatics, Physics supports an understanding of the data useful in some of the most fascinating work being done in Artificial Intelligence, and being well-rounded with regards to maths and science supports applying Machine Learning in a variety of applications. TLDR; — the math and science is important, developing insight about data is the first and most important part of Data Science. That insight comes from understanding the math and science relevant to your data.

Supporting the Data Science specific technologies

Python and R — covered thoroughly

Likely far less surprising will be a rigorous and iteratively more advanced coverage of two languages commonly used in Data Science today — Python and R. Each is introduced initially in a course or tutorial, then build upon by later courses and books at a more advanced level, then further studied specifically in the arena of Data Science. (Incidentally, that’s why you’ll see Machine Learning show up several times, the first will be a short intro, another will illustrate Python for ML, another will illustrate R for ML, and yet another will delve deeper into more advanced topics — this affords a chance to begin practicing Data Science with a brief introduction, then gradually adding more knowledge, skill and proficiency throughout.)

I considered including Scala as well but opted against it — reasoning, though it’s powerful and in wide use, I really needed to narrow focus to achieve the overarching goal of a comprehensive coverage of Data Science rather than getting lost in the weeds of covering every language that could possibly be used for it. Python and R made the cut, Scala is awesome but it didn’t.

Thorough introduction to database technologies

Lastly — coverage of various data storage mechanisms, both SQL and NoSQL, Hadoop, MapReduce (because knowing how to store and retrieve data is essential to working with data) and a brief nod to D3 via a general web development course which touches on it and another which focuses on D3 exclusively, because presenting results is essential as well.

Course with a book purchase, or not

For the first time, I included in one of my guides two courses which require a book purchase (both the algorithms courses for python rely on Cormen’s excellent text). I included them because they best fit my goal of tying every resource into a coherent plan where each and every item on the list reinforced everything else. Those are the courses which best achieved my goal. However, in keeping with my OTHER goal of ensuring that any of my guides be completely accessible to anyone, I also included alternative courses which cover the same material (but using Java rather than Python).

Hands on projects are handled differently from my other guides

For this particular guide, I chose to include all the projects at the end of each Tier, instead of spread throughout — along with a note specifying that the projects are assigned at the beginning of the Tier and due at the end. My intention is that it will allow learners greater flexibility to be actively working on Data Science problems alongside the courses and books, reinforcing what they’re learning consistently. The projects cover a wide variety of data science applications, from solving algorithmic challenges, to handling bioinformatics data, to participating in data science competitions, to exploring AI applications.

I mention it in the notes on the guide, but it bears repeating — if you decide to begin the journey and learn Data Science, from day one start saving your Data Science projects to GitHub and as you learn, give back to the community by writing posts about various topics you now understand and others would find helpful. By the end of the guide, you want to have publicly demonstrated what you know.

The Data Science guide is accessible here — on Github.

--

--