Books, Courses and Datasets that Every Data Scientist Should Know

Mohammed Alhamid
5 min readOct 31, 2019

--

My Roadmap to Machine Learning

Photo by Nick Morrison on Unsplash

First of all, you are on the right track. Working on Machine Learning (ML) toward building your knowledge in the domain of Artifical Intelligent is an excellent choice. AI is making exponential progress in software development as well as almost every aspect of our daily lives.

“YOU DON’T HAVE TO BE GREAT TO START, BUT YOU HAVE TO START TO BE GREAT.” –ZIG ZIGLAR

Advancements in AI

AI is quickly evolving, but it is essential to start. Start by understanding the terminologies, what AI can and cannot do, and what are the benefits of using various machine learning algorithms.

Many organizations are exploring the use of AI in attempts to boost up the performance or help them monetizing the data they have.

Machine Learning is Fun

Although the terms around this domain make it looks like a complicated topic to learn, it is not. It is pretty simple today than anytime before!

In a few simple steps, you can load your data, analyze the correlations, and specify your target that you want to predict. Designing and developing the right algorithm to solve a problem might be a little hard and requires some level of expertise, but surely it is a lot of fun. Tuning the algorithm and working towards improving the results is worth it.

What amazes me about ML and AI is its ability to approach different domains like Computer Vision (i.e., handwriting recognition), Linguistics (i.e., translation), Financing (i.e., fraud detection) and others. With a ton of open source projects, you can play around with different ML algorithms that are available to you, all you need is a “clone.”

My way of learning is splitting topics into a list of stations so that I can manage my time and goals.

Programming Languages?

Many data scientists use Python in their projects. You will find a vast majority of ML books and online courses teach ML in Python. Also, there is increasing support of open source code written in Python, which is a significant acceleration of your start. Here is a great article that compares the evolvements of different languages used for ML.

If you are coming from a Software Engineering background, you are not alone! We understand the Software Development Life Cycle (SDLC), and our mind is settled to observe the development of a continuous iteration process. Programming your ML model is slightly different, and the majority of the effort is spent in feature engineering and modeling once the data becomes available. Many frameworks describe the development life cycle of ML; here is my view towards that.

Machine Learning Development Process

Books

Books in Machine Learning

Books in Statistics

  • Title: “Passion Driven Statistics.” This book is a handy reference to implement the different statistical functions, such as the “Chi-Square Test.” It shows how such features can be implemented in different languages, including Python.

Courses

Here I will highlight my selection of courses that I think contain high quality of resources.

Learning Python:

  • Title: “Applied Data Science with Python” on Coursera

Machine Learning Courses:

  • Title: “Deep learning Specialization.“ Group of 5 courses on Coursera
  • Group of free courses offered by “Fast.ai

Place to Practice

“I’M STILL LEARNING.” –MICHELANGELO

Taking an online course is great, but wouldn’t it be better to learn something by engaging in a real project? Here is a place you should consider:

  • Kaggle : Kaggle is an online community for data scientists. Google acquired it in 2017, and one of my favorite places to learn and find real-world problems. You can find many useful datasets to practice, build models, and compete with some of the greatest data scientists around the world.

Datasets

Data is gold, and finding a high-quality dataset for your experiment is quite a challenge. Here are a few great places to look for data:

  • Kaggle: An online community for data scientists by Google.
  • Google Dataset Search is similar to Google Scholar to search for available datasets and data used for some published scientific articles.
  • Amazon Datasets: You can find many datasets hosted publicly on Amazon AWS.
  • Microsoft Research Open Data: Microsoft made many datasets available along with some other hosted ones.
  • OpenML: is an online platform for sharing data, models, and algorithms.
  • Visualdata: An invaluable source for Computer Vision related experiments.
  • PapersWithCode: A platform to find published articles with their code and datasets.

Lessons Learned

There is a vast selection of online resources to learn ML. The best thing I have done is focusing on mastering Python while learning just the “theory” behind different ML algorithms. Once I have done that, I moved into building and training different models. Engaging in real-world projects is hugely the best to accelerate the learning curve, especially on Kaggle. I was focusing more at the beginning on building and training models on datasets that do not require extensive data cleaning or preprocessing. Doing that helped me to focus more on essential topics, like model selection and parameter tuning.

Focusing on enhancing the model performance is excellent, but building some knowledge around model deployment and integration is crucial. Feature engineering could help us pick the right features for the model, but deployment experience would make the engineering judgment more realistic. Some accurate models that meant to automate some business processes are useless if we cannot put them in a production environment.

Final Note

At the beginning of your learning journey, you might feel many black boxes around different models work like magic! You might fail to figure out why the model was unable to predict certain test cases or how one hyperparameter could change the results significantly. You might find it useful to start with small datasets with a few numbers of features that enable you to trace and debug the model.

Bonne Chance!

I would like to thank Chad Marston and Jorge Castañón, Ph.D. for their great feedback on this story.

--

--