Main Challenges of Machine Learning

Uttam Singh
5 min readApr 10, 2023

In this era of Generative AI, it seems that the fundamentals of machine learning are often overlooked. Achieving the level of accuracy and precision seen in popular models such as GPT-4 requires a significant investment of time and effort, as well as overcoming numerous obstacles. That’s why I’m writing this blog post, to shed light on some of the challenges involved in creating machine learning models and to emphasize the tremendous amount of work required to build a flawless model.

Challenge 1. Insufficient amount of Data

For a toddler to learn about a cat or a dog, you just need to show them a cat or a dog once, and they can recognize them most of the time, even in different colors or breeds. However, for machine learning to learn about cats and dogs, it requires millions of images.

In a famous paper published in 2001, Microsoft researchers Michele Banko and Eric Brill showed that very different machine learning algorithms, including fairly simple ones, performed almost identically if they are supplied with enough amount of data. It should be noted that small and medium-sized datasets are still very common, and it is not always easy or cheap to get extra training data.

Challenge 2. Non-representative Training Data

Let’s take the example of building a model to predict animal classes. Imagine that your training data only contains images of cats and dogs, but your test data includes images of horses and rabbits. In this scenario, your training data is not representative of the real-world problem you are trying to solve. You might wonder why you would use a model that only classifies cats and dogs to classify horses and rabbits. It’s clear that the training data needs to be more diverse and representative of the real-world problem to achieve accurate results.

Let me provide a different example that may make more sense. Suppose you are trying to predict crude oil prices using data that only includes crude oil prices from the previous 20 years. Even if you use the best possible model, you may not be able to reduce your errors. The reason for this is that your training data is not representative of the complex web of factors that affect crude oil prices, such as the balance between supply and demand, geopolitical events, OPEC policies, economic conditions, natural disasters, and currency exchange rates. Therefore, to achieve accurate results, you need to incorporate a more diverse range of data that is representative of the real-world factors that impact crude oil prices.

Still, the best example of nonrepresentative data is the Literary Digest Poll.

Challenge 3. Poor Quality Data

If your training data is full of errors, outliers, and noise (e.g., due to poor-quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well.

It is often well worth the effort to spend time cleaning up your training data. The truth is, most data scientists spend almost 80 % of their time doing just that.

e.g. of a Outlier (Credit — https://unsplash.com/photos/ku_ttDpqIVc)

Challenge 4. Irrelevant features.

As the saying goes — “Garbage in garbage out.” The performance of a machine learning model is highly dependent upon the relevance of the features present in the datasets. If your data contains too many irrelevant features, your model will also learn from those irrelevant features which ultimately impacts the model's performance.

A critical part of the success of a machine learning project is coming up with a good set of features to train on. This process, called feature engineering, involves the following steps:

  1. Feature Selection — Selecting the most relevant features to train on among all existing features.
  2. Feature Extraction — (combining existing features to produce a more useful one e.g. dimensionality reduction algorithms like PCA, tSNE can help)
  3. Creating New Features — by gathering new data.

Challenge 5. Overfitting the Training Data

Say for example you joined your masters and you meet a person who is not social at all, and you make a presumption that everyone in masters is not social and doesn’t help each other. Here you are overgeneralizing your assumption after interacting with just one person. The person you met might be an outlier from the whole class.

Overgeneralizing is something that we humans do all too often, and unfortunately machines can fall into the same trap if we are not careful. In machine learning this is called overfitting: it means that the model performs well on the training data, but it does not generalize well.

Just think if our model learns from the below messy room that this is the best way to keep everything in a room.

A photo depicting a messy room

Overfitting happens usually when there is lots of noise in the data or there is not enough data to learn from.

Challenge 6. Underfitting the Training Data

An example of underfitting in real life could be a student who is studying for an exam. If the student only reads the material once and does not practice any questions or review the material in depth, they may not perform well on the exam. This is because they have not learned the material well enough to apply it to different scenarios or questions.

In this case, the student is underfitting the material by not studying it thoroughly enough to achieve a good understanding of it. This is similar to underfitting in machine learning, where the model is not complex enough to accurately capture the relationships between the features and the target variable.

Just like the student needs to practice and review the material to perform well on the exam, a machine learning model needs to be trained with sufficient data and a suitable level of complexity to avoid underfitting and achieve good performance on the task at hand.

These are just a few of the challenges that a data scientist or machine learning engineer faces when building an effective machine learning model. There are ways to overcome these challenges, but for now, I want to focus on understanding the challenges themselves. In my next post, I will delve into ways to tackle these challenges.

For future updates please follow me.

References

  1. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 3e: Concepts, Tools, and Techniques to Build Intelligent Systems — by Aurelien Geron.

--

--