Journey into data science and machine learning

Like a kid in a candy store, I’m happy to announce that I completed Professor Andrew Ng’s Introduction to Machine Learning course on Coursera. In this 11-week course, I sought to bridge my statistical knowledge and background in predictive analytics into artificial intelligence and machine learning (AI/ML). After spending more than 11 weeks, I can say that this is a good starter course on AI/ML and wanted to share my thoughts on getting started with machine learning courses such as these:

  • Introductory statistics and a bit of calculus knowledge is helpful (although not required). As the course goes over gradients and sum squared errors, it helps to understand why gradients are derivatives and why we’re doing sum squared errors to figure out the cost function. Sure, it’s possible to get through the course, but it may not be as easy to do so without understanding the underlying mathematical concepts.
  • If I had more programming background, I think it would have been slightly easier to get started. I started with more limited coding experience but to go through the assignments, it challenged me not just on the statistics but also on the development implementation. Specifically, I had inertia initially when I started the assignments because I was not sure where to begin. Now I’m much more comfortable not just reading code but also on the functional programming side. As result, I appreciated the consideration that we started with using the open source version of MATLAB, called Octave.
  • Use the resources, community, and tutorials as they can be extremely helpful to complete the course. Don’t necessarily solve the assignments by strictly following the assignment tips and recommendations. Try vectorization and even your own approach.
  • Late in the course, I figured out how to go faster in the course. For example, I’d work on the assignment for Week 9 and watch the lecture videos for Week 10 at the same time. By staggering it, I’d was able to double my weekly velocity. Also, I’d watch the videos at 1.5–2x speed. Taking notes really helps as well.

Comparing my predictive analytics background with what I learned in this course, I can say definitely that there is quite a bit of conceptual overlap in the area of statistics. My experience had been in the context of linear/multivariate/logistic regression, decision trees, and K-means using Excel type of applications. For example, my context had been around customer purchase decision support such as looking at a set of customer demographic or historical inputs to predict their likelihood to purchase in the future.

Similar to concepts in machine learning, a dataset would be split into training and validation data sets. The training data set is a set of data that has independent input variables and the actual results of an outcome (e.g., 1 for purchase or 0 for not). Based on the training data set, an algorithm with the betas (or called Thetas) would be “trained” through different approaches such as various regressions. Selecting the best approach and then applying them to a “blind” validation dataset, data from the validation set is applied to see if the results were similarly effective.

Beyond where my experience in predictive analytics, this machine learning course taught me about many key concepts and tips:

  • Matrix algebra and vectorization are the most coolest concepts to me. It’s not that for-loops are obsolete, but there’s just efficiency savings using vectorized approaches to solving cost and gradient problems. Reaching way back to high school algebra memories, I have had little appreciation of matrices until now.
  • Holy moly! 3-way split on the dataset. Training, validation, AND test! Enough said.
  • Learning curves are a very useful method to distill whether there is high bias or variance. It’s a plot comparing the performances of the cost functions (validation and training) to show the differences in error at a specific training data size. If both cost functions converge on a fairly high error, then solve high bias. If there’s a significant gap between cost functions, then look at solving high variance.
  • Regularization (lambda) can be used to reduce overfitting.
  • F-Score ( 2 * Precision * Recall ) / ( Precision + Recall) allows one to calculate the effectiveness of an algorithm rather than just looking at Precision or Recall by itself.
  • As a visual learner, I can appreciate that three dimensions affords the ability to plot and visualize a dataset. However, in machine learning, there are just way more features (and dimensions) than a human mind can comprehend. Hence, principal component analysis (PCA) allows one to reduce the dimensions into a surface plane or line. More importantly, this is done really to improve the learning algorithm’s performance having reduced dimensionality.
  • When you’ve a largest dataset, it can take a long time and be very costly to figure out gradient descent if you use batch method, which sums up all the partial derivative of the training set cost function. Enter methods such as stochastic and mini-batch, which either takes one example before updating theta or a hybrid of batch and stochastic.
“It’s not who has the best algorithm that wins, it’s who has the most data.” — Andrew Ng

While this course was an introductory course, it sparked my interest to continue learning about map reduce, python, and deep learning. Map reduce is used to accelerate the progress of the learning algorithm by splitting up the work into multiple machines. Python appears to be the most practical and common place to work in machine learning as I’ve talked to colleagues so I’m extending my python experience by working through Jose Portilla’s course which covers the range from numpy and data analysis to data visualization in pandas and such.