Follow on YouTube for free lectures to solve machine learning and artificial intelligence problems.

12 Pitfalls of Machine Learning

Published in

create4D

4 min readJun 1, 2019

At A few useful things to know about machine learning, Pedro Domingos stated that developing successful machine learning algorithms require substantial amount of “black art” which is difficult to find in textbooks. Which means that, one needs to learn witchcraft or develop some intuition to design the best algorithms. Below, I summarize the pitfalls and how to avoid them, as they were mentioned by Pedro Domingos.

1. Learning = Representation + Evaluation + Optimization

There are 3 major components of machine learning. Those are; 1- choosing the right classification/regression algorithm, 2- choosing the right cost function, 3- choosing the right optimizer. Combination of these three selections determines the performance on test data.

2. It is generalization that counts

Generalization is the ultimate goal of designing a machine learning algorithm. Importance of cross-validation cannot be overemphasised.

3. Data alone is not enough

Feeding raw data to the algorithm might not be a good approach. The programmer should have some pre-knowledge about data, so some knowledge for better representation can be applied. Pre-knowledge and assumptions would be helpful.

4. Overfitting has many faces

One should keep in mind that overfitting measure tells about the variance but bias. The overfitting might indicate lack of noise in the training data, however that might not be the specific reason. It might be a good idea to calculate the false discovery rate.’

5. Intuition fails in high dimensions

It is obvious fact that we cannot visualize a data classification/regression in our minds when the data is more then 3 dimensional. It would be good idea to look at dimensionality reduction algorithms when it is possible.

6. Theoretical guarantees are not what they seem

The algorithm performance is used for practical reasons. Even 100% test performance doesn’t guarantee the performance on new data. We haven’t trained the algorithm with infinite number of examples. Even then it might not be possible to design the perfect algorithm mathematically.

7. Feature engineering is the key

The biggest effort+time should be put into preparing the data with a good feature engineering algorithm. That is more important than designing a good machine learning algorithm. (Feature engineering was mentioned at our School of AI lecture.)

8. More data beats a cleverer algorithm

Ironic but true. More data makes the algorithm cleverer, not the design of the algorithm. Probably, that is because of that more data helps to generalization.

9. Learn many models, not just one

Truth is that you cannot choose the best machine learning algorithm without knowing all possibilities. (There are algorithms which help to choose the best algorithm. Check the adaboost and adaNet the School of AI lecture on this link. )

10. Simplicity doesn’t imply accuracy

Number of parameters to set (or number of layers in a neural network) doesn’t have any direct connection with the accuracy of the results. One shouldn’t assume that adding more specific parameters (or adding more layers) would increase the accuracy.