Getting started in machine learning
Someone at work recently asked me about getting started in the field of machine learning. Embarrassing realisation — I should have noted this stuff down and share more widely. I am by no means finished with my journey, but there are some things that have helped me already.
These are some things that worked for me — please go ahead and adapt to your own context and learning style. Most importantly experiment and find what works for you, and let me know if I can help in any way!
To be clear, this is not a detailed blog about how to do machine learning (e.g. linear regression, neural networks etc.) — there are other people who can do this better than me. This is a post about what tools, resources and strategies can be used to learn those things.
The only assumption made is that you have some software development experience — at minimum a familiarity with an object-oriented programming language. Moving from software engineering into the world of data science was a huge challenge and required an accelerated learning curve, though having basic computer science, programming and SQL skills helped a lot. Where possible, I will try to include resources for those without that base.
Timeline of learning
My journey sort of looked like the diagram below. Truth be told, this is the ideal timeline if I had the foresight to do things in the right order — instead I jumped into things too early, stopped and revisited strategies when most appropriate.
Python essentials — Development environment
Setup a basic environment upfront for experimenting with Python and machine learning. Use Conda to isolate python environments — it makes experimenting with different setups much easier e.g. a different version of Python or pip libraries.
- Install Conda — https://conda.io/docs/user-guide/install/index.html
- Create a new conda environment
conda create -n py36 python=3.6
conda activate py36
3. Install some dependencies via pip
pip install pandas scikit-learn
4. Install Jupyter notebook- https://jupyter.readthedocs.io/en/latest/install.html
pip install jupyter
Python essentials — Jupyter
Jupyter Notebooks is a web-based IDE for developing Python applications that combines code with rich text elements (paragraph, equations, figures, links, etc….). It allows easy experimentation of Python programs and let’s you start hacking away with minimal setup (see previous section).
It’s particularly useful for data analysis and machine learning since Python code can be combined with visuals and comments about the data, and therefore documenting data analysis and machine learning outcomes becomes easy. Please pay it forward and share your own Jupyter notebooks on Github!
Python essentials — Language
Python seems to be used so widely in data science and has great libraries and frameworks. It’s more popular than R in my opinion — Python is not only easier to learn but there are so many clear, well written example projects on the web that are implemented in Python. Everything else rests on knowing Python, so definitely the place to start.
If you have used other languages extensively, then check out these resources to get into Python quickly:
If you’re an absolute beginner to programming, then try:
- Python Basics
- The Python Guru
- Programming in Python 3: A Complete Introduction to the Python Language (book)
Jumping into machine learning without learning some statistics and data analysis techniques is a mistake in my opinion. Understanding the statistical quality and distribution of the data helps you decide if machine learning is even possible and which techniques are most appropriate.
Start with learning the basics of statistics. At this stage, don’t worry about mastering this area; but instead focus on the basics — if you are refreshing your knowledge then this could help:
If you are new to statistics and want to cover more thoroughly, then the following will help you form a solid understanding of the basics. Be aware that a course will add a few months to your learning timeline:
- Head First Statistics
- Statistics without Tears: An Introduction for Non-Mathematicians
- Statistics: An Introduction: Teach Yourself: The Easy Way to Learn Stats
- Workshop in Probability and Statistics
- Intro to Statistics
- Basic Statistics
- Statistics and Probability
Get familiar with Pandas — it is a Python library for data analysis and really quite easy to get into once you have a foundation of statistics and Python. It makes learning data analysis fun — you get a sense of progress, it is easy to use and abstracts the details away (statistical functions, plotting).
This book will help you learn Pandas:
Machine learning — introductory articles
If you need a birds-eye view of machine learning before jumping into the details, here are a few short blog posts to dip your feet into:
Of course, there are loads of these type of introductory articles on Medium, so go explore!
Machine learning — specialist course
The mathematical aspects of machine learning were a scary thought. To be fair, there is no way to get away from the mathematics entirely; though if you are willing to accept the math works and just apply it, this will definitely help.
Try Coursera’s Machine Learning by Andrew Ng — this is a in-depth course, quite academic in style and will take several weeks to complete. It requires you to learn Octave/Matlab which can be a distraction — but Andrew encourages you to trust that the math works (yet provides you enough intuition about why it works) and moves you on swiftly to applying the algorithms to practical examples. The exercises are invaluable so do not skip them — they become so addictive! This course is the de-facto standard for anyone entering into the world of machine learning.
Also check out Google’s Machine Learning Crash Course — it’s lighter than the Coursera Machine Learning course and could be completed with a week of solid effort. There is a prerequisite of basic machine learning knowledge for this course (it skips on some details), whereas the Coursera one does start right from the beginning and covers everything.
Open data sets
There are many open data sets on the web — my favourite place is Kaggle which hosts data science competitions. If you are not ready to jump into the competitions, you could simply explore the data sets and practising machine learning on them.
Two simple data sets to start out with:
To really understand some machine learning algorithms under the covers, you could try to implement them from scratch instead of using libraries (e.g. scikit-learn). You can compare the results of your algorithm implementation with results from scikit-learn to get a sense of confidence that your own implementation works.
Learn from past work by exploring the many open-source Jupyter notebooks (e.g. on Github) for various Kaggle competitions. Look out for the ones with data analysis, exploration of multiple machine learning approaches, and plentiful comments and conclusions throughout.
Build a reference library
Reading these books cover-to-cover will be overwhelming — instead use them for reference when solving real problems e.g. open data sets.
- An Introduction to Statistical Learning (there is a free copy on the web)
- Data Science from Scratch: First Principles with Python
- Python for Data Analysis, 2e
- Programming Collective Intelligence: Building Smart Web 2.0 Applications
- Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow
- Hands-On Machine Learning with Scikit-Learn and TensorFlow
Form a list of high quality books by checking reviews, asking other people in the community (Stack Overflow, Quora, at work) or read a few chapters before you buy. Some books are really not well written — I find this to be quite common in the data science community where everyone seems to be writing books!
Blogs & papers
A few blog posts were mentioned as part of machine learning introduction section. This is about reading much more widely and deeply. Interesting topics could include:
- Machine learning in relation to data science as a whole
- New or less well known machine learning algorithms
- Machine learning in specific domains e.g. finance, weather forecasting, medicine, advertising, image/video classification
- Performance analysis of machine learning techniques
- Implementing algorithms from scratch in Python or other languages
- Patterns/technologies/frameworks/libraries for machine learning
There are many blogs out there : my favourite is Towards Data Science. This author has captured some good ones too:
Most papers come out of the research departments of top academic institutes. These are too numerous to list — but search around for papers from Stanford, Berkeley, MIT etc. They usually have repositories somewhere. Some other sources of research papers:
Find 118+ million publication pages, 15+ million researchers, and 700k+ projects. ResearchGate is where you discover…www.researchgate.net
Google publishes hundreds of research papers each year. Publishing our work enables us to collaborate and share ideas…ai.google
The SpringerOpen portfolio has grown tremendously since its launch in 2010, so that we now offer researchers from all…www.springeropen.com
Meetups can be hit and miss — a few can be unclear on level of proficiency required or tend towards extreme sales pitches. Meetups on machine learning are much easier once you have some foundational knowledge — before that they can be super confusing and a waste of time from a learning perspective.
A few groups I recommend in London:
Do your research and check if the talk has something interesting to you and is within your current range of understanding. The other benefit to learning is meeting people in the community and getting their recommendations.
I cannot comment much on this as I haven’t yet been to a data science conference (though I plan to attend ODSC in 2018 — yay!). A bit of research and a few recommendations have helped form this list so far:
As with meetups, most conferences assume a degree of proficiency — might not be wise to go to one right at the start of your journey, but could be a great way to valid your learning much later in.
Find a mentor
After several months there were a million things jumbling around in my head. After playing with a few data sets I had learnt a lot, failed some more and was left confused about what was next and what is missing in my learning.
With help from my mentor, I formed a mind map my known-unknowns and this provided structure around the details I needed to learn.
A mentor in data science can also direct you to learn the most important things first and help you apply the 80/20 rule effectively.
An effective way to structure learning is to take real problems (e.g. from Kaggle) and work with a mentor to explore the different ways to solve the problem and tick things off your list as you go along.
Join a data team
If you can, get onto a data science team in an organisation that is solving real data problems with machine learning. Not only do you get a local community to engage with on a daily basis, there is no better end-to-end immersive learning experience than working on a machine learning data product alongside talented people.
This was a broad level overview of how you could get going in your learning of machine learning and data science. There is so much that is not captured in this post, simply because I have not got there yet! I welcome your suggestions for other resources that you have found useful.
Be resourceful in your approach to accelerated learning — some things will not be applicable in your context or may not work for your learning style. Be willing to try things and throw them away if they do not work for you — form your own learning pathway.
Thank you to Unruly (my current workplace) for supporting me in learning about data science! Being part of the data team, a learning budget (20% time, conferences and books) and access to a community of people with similar interests — without these things my rate of learning would be much slower!