My quest for machine learning

I jumped into the field of machine learning 18 months ago. It was overwhelming at first: Where to start, what to read. And the math, the math. I wanted to get a feel for the field, but how do you do it?

While I’m still at the beginning of the journey, looking back at those 18 months, I could discern some dots, that I’m trying to link in this post. Along the way, I also found some interesting reads, that are also part of this post.

the blog is not meant as an in-depth analysis, but as a retro of sorts on my journey. Maybe it helps someone in finding a path or finding some interesting sites and pointers. Maybe it triggers someone to help me on the areas I’m overlooking (I hope the latter ;-). Anyway, let me know what you think!


The basics

What’s the fuzz about machine learning? And what’s happening under the hood? While it s easy to start a complex algorithm on tensorflow without prior knowledge, getting the basics helped me to understand the rest.

Really triggered after this series? Andrew NG created one of the best-rated MOOC’s out there. You need some grit to keep going, but it’s worth the while. Andrew explains the math in easy-to-understand ways, gives real-life examples on how to use the stuff in real life. Every week has a test and a programming case. Cool to see how you can make stuff work. The programming part is kept easy, most of it is filling the dots. His new book (draft here) is a hands-on guide on how to structure such a project and covers more in-depth his experiences and tips&trics.

Looking for something more generic? Elements of AI is a Finnish experiment to get all citizens of Finland up to speed on AI. Nice and varied course, mostly on the impact, possibilities and the ramifications. Also, nicely designed and takes some time, but at a leisurely pace.


A dataset and an idea

I don’t know why, but I put off the logical next step for a while.

Maybe learning from and listening to Andrew Ng in the course was too much, or I liked to stay in the theoretical. Anyway, when I got to it, I saw problems and datasets everywhere. Maybe, that’s also why the field is so big ;-)

Seeing other datasets, coming up with sets of my own really helped in the end. Even if I did not pursue a lot of the ideas, they helped in getting why the algorithms are used.

Why? In the end, it’s about the data and the insights that you want to find. Starting with the why also helped me in the learning process: The pieces of information glued together with a goal and a dataset. Once I saw a few, more and more ideas came up and I tried to sample small projects around it. Most were not more than a few evenings, but they helped me gain insight.

So, what did I learn?

  • It’s everywhere: once you start thinking about it, you see datasets and problem statements everywhere. Fun to think how you could crack these.
  • Getting the data and getting the data right is the hardest part: How do you transform a website into distinct words? How do you filter the tags? How do you remove stopwords? How do you read it into the right data format?It’s a combination of technical skills and experience on the data transformation. Luckily, as you dive into a specific problem, there are a lot of blogs and items on this.
  • It’s easy to run a model once you’ve got the data and the idea sorted out.

Some of my sources:

  • Textio wrote a great blog on the language in tech companies, especially on recruitment texts. It was fun to see if I could a similar thing on a much smaller scale with our own recruitment site. I used beautiful soup to get the data from our site and databricks to analyse.
  • I’m working on a dataset from our prometheus setup and evolve this to a near-real-time predictive algorithm. For now, I’m bugging engineers to help me fix the basics ;-) The hard part here is to get the data out.
  • Kaggle has some great datasets and challenges. I haven’t gotten that far to compete, but it’s nice to see the sets and challenges, if only for inspiration.
  • We’re heavy slack users and have automated quite some of our stuff with hubot. Only of the most-used features is the creation of a channel when something is up. Can we use NLP on these channels to discern patterns? I started out with jupyter on anaconda for that project.
  • Fivethirtyeight has a nice and broad collection of data-analysis stories, ranging from politics to sports. And their datasets and code are found on github.

Tools and tech: notebooks

source: http://jupyter.org/

Notebooks seem the way to go to dive into hands-on. They are a mix of code and notes. There are many examples out there, that take you step-by-step through a problem. If you build them yourself, they’re a great help to keep notes and your solutions together. Sharing them is also easy: through github or just a direct export. Platforms such as databricks allow notebook sharing.

  • Notebooks with examples from amazon’s sagemaker on github, ranging from basic setup to more advanced and specific sagemaker examples.
  • Anaconda is a great tool to start different notebooks. Jupyter is the one I use.

Next steps

I’m still tinkering with notebooks and at the moment, I’m figuring out how to use databricks and apache spark, that enables you to run your analysis much faster and uses machine learning to optimise the stuff you want to do.

But I’m not sure what the next step or learning will be. Most has been on getting the basics, maybe it’s about getting it live. Andrew Ng’s new book, machine learning yearning, is a nice one. Also fun as the drafts chapters come out bit by bit.

But tips in other fields are also appreciated! Please let me know if I missed out, need to dive deeper to learn more or just questions.