MOOCs and Books for Data Analysts

A list of free online courses and books for those interested in dealing with Data

Carlos Andrade
6 min readJun 25, 2013

--

This is a list of free online courses or books I have either browsed over or taken when I was an undergraduate. I sub-divided them below by field, and in case you are new to this whole thing of data analysis, I am also adding in a very informal description of where I have applied what I learned during my undergraduate days.

As with any MOOC, I did not become any expert in those sub-fields (in fact picking one to go really deeper was one of my interests on browsing all this). However, knowing those sub-fields and what they are all about helped me interface with researchers who knew what they were doing, and actually apply them for a real problem under their supervision, rather than just delegating the work of the research group to them.

Like one statistician once told me:

You won’t be able to reason about a solution to a problem if you don’t even know what a given method, it’s constraints and it’s solution are.

Although the following Data Science diagram does not include all the sub-fields below, you should probably find it a good overall picture for the courses below.

Drew Conway Data Science Venn Diagram

Machine Learning

Stanford Machine Learning

This course gave me a practical start in machine learning using Octave/MATLAB. Since by this time I took this course I was in need of applying machine learning to my problems, it helped me considerably. It also had many practical advice, that is usually forgotten in more theoretical focused courses. It did not delve much in theoretical aspects.

Caltec Learning From Data

Caltec course is by far the best available lectures to get you started in Machine Learning. The lecturer’s book is amazing in how easy you can understand very complicated concepts. It was for me a very good complement for stanford machine learning course.

Neural Networks for Machine Learning

This courses covers much more deeply neural nets, which is just covered in a week on the other machine learning courses. In particular, they even cover the RNN types.

Probabilistic Graphic Models

Probabilistic Graphic Models

PGM’s stand their own ground apart from Machine Learning. In this course introduction, it is mentioned that they are better over certain scenarios where machine learning is weak. I’ve seen mostly applied to computer vision, but I have not delve deep on this one (yet). I’ve taken this sub-field as an intersection between statistics and graph theory.

Data Science

Introduction to Data Science

As the name suggest, this covers very broadly a bit of each part of what a data scientist does. I enjoyed the hands-on part of this course in respect to the sentiment analysis of twitter, map-reduce focus and so on. It was the ‘closest to reality’ assignments I’ve seen.

Harvard Data Science

This was a class about data science given in Harvard that all the material was made available for free online. They suggest a framework called VMC (Visualizing, Modelling and Computing) and also include other contents I haven’t seen around the other courses in a broad data science view like HMM’s and Monte Carlo. They also support interactive visualization, which is a different take from other courses, since they use D3, instead of ggplot or lattice.

Berkley Data Science

Another course in data science that made available the slides (also note that a previous semester lecture slides is available). They enforce the use of R ddply for pre-processing.

Columbia Data Science

This was a course whose material was made available as well. The professor of this course is publishing a book entirely about data science which is only available as pre-order at this point. Their take in

Data scientists should be able to craft production code

is also very interesting, and they point to a collection of free video lectures and slides that aids in doing so: Software Carpentry.

Statistics

Computing for Data Analysis

This was the second course I’ve taken when trying to make sense of data analysis. It covers a lot of ground in R. The focus on statistical analysis is given in a subsequent course. Do notice I am not a bio-statistician, yet sometimes you learn even more by seeing how statistics is applied in other fields than trying to learn only from your own.

Data Analysis

Subsequent course from Computing for Data Analysis. Here the focus is stronger in statistics, and weaker in R. I’ve benefit immensely from this course. Many data-sets are used from all across the internet that are not related to bio-statistics, which gives a very broad picture of what statistics can do for you.

Statistics One

This was the very first course I’ve taken in statistics, before even considering data analysis in general. The majority of examples were focused in psychology, but easy to understand. It has a strong focus on validity. Even being targeted for a broad audience, I was able to notice many common slips people do in applying statistics thanks to this course. It is focused more in theory than in applying statistics, but it provides R scripts for each week’s theoretical material.

Latent Variable Models

Latent Variable Models

This is a very advanced course that is yet to start and is within Statistics, but I am adding here separately for reference. Latent variable models covers a lot of models from statistics, not to mention the idea of hidden variables which I have only seem briefly in probabilistic graphic models. I have been learning mostly about this subject from a local research group of my university.

Text Mining

Stanford Natural Language Processing

Not much to say about this course, yet to explore text mining. In most cases I have only come across the need of performing exploratory clustering of words by frequency and what is also called context mining.

Toronto Natural Language Processing

Not much to say about this course, other than it seem harder than the Stanford version on Coursera.

Social Network Analysis

Michigan Social Networks Analysis

A good introduction for SNA, helped me get some idea of what it is about. I have mostly used Gelphi after taken this course for analyzing some patterns in of my projects in respect to education.

Stanford Social Network Analysis

This is a more advanced version of SNA, although it covers the basics as well.

Visualization

Visualization are very rare courses to be found. Those two were found by accident, and their focus are not quite of exploratory data analysis, but the creation of infograms using a myriad of tools.

Information Visualization

I didn’t go much further in this course, but you should expect to learn about how those infographics are made in this course among other things.

Data Visualization

This course was offered once in a moodle environment that is no longer available. I found the video lectures on youtube. The link leads to the first video, but the other lectures should probably be available on youtube.

Databases

Stanford Database

Covers a lot of ground on all kinds of databases. I took parts of it as complement for my undergraduate databases classes.

10gen MongoDB

Covers pretty much everything I could need in handling JSON data and structuring it in a MongoDB database. Given that a lot of websites are providing data easily as JSON files, knowing how to store them in a MongoDB makes life much easier in exploring and preparing .csvs for tools like R and Octave.

Graph Database

For the sake of including all databases I’ve used for data analysis, this free available book contains a lot of ground for Neo4j graph databases. They’ve been to me particularly useful for social network analysis.

There are courses being made available every day from different platforms, and perhaps it can be the case I missed one or more (feel free to point them out and I will add them here). A full list of MOOC courses I usually check is available on the link below.

http://www.mooc-list.com

--

--