More Machine Learning for Hackers

Jaidev Deshpande
5 min readMay 3, 2016

--

The commonest questions about data science are to do with getting started. Now, a lack of exposure to data science could mean either or both of the following:

  1. a lack of programming and hacking skills
  2. ignorance of the theoretical principles in machine learning

I have even found myself vacillating between these two states — and to limit the sense of panic and doom that normally accompanies this statelessness, I maintain a list of resources that contain instructional material which achieves a healthy tradeoff between the two states. In other words, these are resources that help you learn machine learning and its underlying disciplines, as well as the nuances of the software tools available to implement them.

On top of my list is Machine Learning for Hackers. One of the reasons I love this book is that it is actually a textbook on machine learning meant for hackers. It assumes only the most basic programming skills in R (which can be obtained by novices within a few weeks) and makes the reader comfortable with the basics of machine learning — enough for them to be able to eke out basic insights from small datasets. The rest of my list contains resources that are quite in the same spirit as Machine Learning for Hackers. For a resource to qualify on this list, it must:

  1. it must be about machine learning or programming,
  2. be a learning resource, i.e. it must contain instructional material. It must especially contain code and theory. So research papers and API documentation are out.

Since I’ve almost exclusively worked in the scientific Python ecosystem, this list is heavily biased. Also, a few of these additions might not fall strictly in the domain of machine learning. Additions are welcome.

Learning Machine Learning and its Branches

  1. Will It Python?: A translation of Machine Learning for Hackers into Python. Ideal for people who want to see Python scripts and IPython notebooks for the material in the book.
  2. Understanding Random Forests: This is actually Gilles Louppe’s PhD thesis. He has very magnanimously made a GitHub repo containing Python scripts accompanying his thesis. The scripts are very well organized by chapter of the thesis, and are very well annotated. The thesis itself can be found here.
  3. Think Bayes: A very readable book on Bayesian methods by Allen Downey. Quoting from Andrew Gelman’s review of the book:
    It’s super readable and, amazingly, has approximately zero overlap with Bayesian Data Analysis. Downey discusses lots of little problems in a conversational way. In some ways it’s like an old-style math stat textbook (although with a programming rather than mathematical flavor) in that the examples are designed for simplicity rather than realism. All the code samples are in Python too.
  4. Bayesian Methods for Hackers: This is a wonderful book/repository on Bayesian methods in Python, second only to the previous entry. Studying this resource will make sure that you have all the practical knowledge of Bayesian methods you might need. These notebooks also make a wonderful use of the PyMC library.
  5. PyLearn2 Notebook Tutorials: These are a set of very limited, but very well written tutorials on some of the fundamentals of deep learning and neural networks, particularly softmax regression, MLPs, CNNs and autoencoders.
  6. Stanford CS231n: This is the Stanford University course on convolutional networks for visual recognition, and they have a lot of public material — on GitHub and YouTube.
  7. Deep Learning Code Tutorials: This is a super comprehensive list of articles and essays on deep learning and the accompanying code. This “course” starts with simple things like logistic regression, and ends up with many popular deep learning architectures.
  8. Machine Learning — CS — Oxford University: This a a good course for learning machine learning via Lua and Torch.
  9. Peekaboo: Andreas Mueller’s blog, where he writes about computer vision and scikit-learn.
  10. YHat blog: yHat is a provider of a platform for powering data science applications. They have their own API, which they keep demonstrating in their blog posts. This hides a lot of complexity in the underlying analytical processes, but this is what ends up making it ideal for learning. You’re not likely to get bogged down by the technicalities of implementation when you want to get an analysis done.
  11. Python for Signal Processing by Jose Unpingco: This repository contains IPython notebooks accompanying the Python for Signal Processing textbook. Thankfully you don’t need the textbook to go through the notebooks, as they are fairly self contained, and also mathematically involved.
  12. johnmyleswhite.com: The website of John Myles White, one of the authors of Machine Learning for Hackers and a Julia developer. He writes about statistics and mostly uses R and Julia in his posts. He’s a very good writer and his posts are refreshingly insightful.
  13. Neural Networks for Machine Learning: A Coursera course by Geoffrey Hinton. This is a very nice introduction to neural networks. Particularly unencumbered by too much mathematics, the course takes a very high level overview of neural networks.

Learning Machine Learning Software and Libraries

  1. SciPy Lecture Notes: As the page says, “One document to learn numerics, science, and data with Python”. Going through these notes diligently enough will leave the reader with a very good understanding of NumPy, SciPy and some scikits that help you tame your data.
  2. Scikit-Learn Tutorials: I’m not too sure if this entry belongs in this section — it’s definitely much more than just a scikit-learn tutorial, just as Machine Learning for Hackers is more an R tutorial. Scikit-learn is probably the most actively developed and maintained machine learning library that I know of. It is definitely the state of the art in machine learning within the Python ecosystem, and their documentation is rich enough to satisfy almost all practical machine learning use cases. (Other than deep learning)
  3. Scikit-Image User Guide: This doesn’t strictly fall under the umbrella of machine learning, but given the ubiquitous nature of computer vision applications, it’s worthwhile developing a good understanding of the fundamentals of image processing. This user guide starts with NumPy and ranges a variety of the most common concepts in image processing.
  4. Getting Started with Statsmodels: This is a nice hands on tutorial on fitting a statistical model to a dataset — and extends to a lot of sophisticated methods in statistical modelling.
  5. Theano Tutorials: These tutorials are very closely linked with the 7th entry in the previous section.
  6. Donne Martin’s IPython Notebooks: This is a huge GitHub repository which contains notebook examples on a number of different data science libraries, API services and deep learning packages.

Miscellaneous Resources

  1. The Glowing Python: As the blog itself says, “A collection of sloppy snippets for scientific computing and data visualization in Python.” I think the data visualization component of this collection is very nice. I’ve learnt a lot about matplotlib just from this blog.
  2. The Endeavour: This is John D Cook’s blog. John is an applied mathematician who consults for clients in a wide variety of disciplines, and he writes most of his code in Python. There’s always something interesting going on on this blog, and you can be sure that you’ll be able to reproduce it in Python. John also runs many interesting Twitter accounts: http://www.johndcook.com/blog/twitter_page/
  3. Cleve Moler’s Fourier Analysis tutorial for MATLAB: A very practical and easy to understand introduction on how to do basic Fourier analysis in MATLAB.
  4. Linear Algebra from MIT OCW: This course is a classic — taught by the legendary Gilbert Strang.

--

--

Jaidev Deshpande

Data Scientist at Juxt-Smart Mandate. Electrical engineer, signal processing, machine learning, continuous deployment, amateur math, history.