Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

Member-only story

Avoid These Easily Missed Mistakes in Machine Learning Workflows — Part 1

7 min readJan 22, 2025

--

A collage of the three errors this article focuses on: misusing identifiers, ignoring rare feature values, and incorrect data partitioning.
Image by the Author.

One greatly enjoyable thing about having been involved in the domain of machine learning for as long as I have is the opportunity to always learn something new. That something new can either be a new tool or methodology (given the rapid development in the machine learning landscape, there’s never a shortage of that), but it can also be the discovery of erroneous processes in our work that we simply had never been aware of.

Some of these can be quite obscure and hard to spot at first glance. If these erroneous processes do slip into your model development, there’s a good chance it will hurt its predictive power and thus its reliability, and, ultimately, its applicability.

In this article, which is the beginning of a series exploring common pitfalls in machine learning, we’ll focus on three data handling errors that can occur both during the preprocessing phase but also during the modeling phase:

  1. Using Numerical Identifiers as Features
  2. Random Partitioning Instead of Group Partitioning
  3. Including Feature Values with Insufficient Observations

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Thomas A Dorfer
Thomas A Dorfer

Written by Thomas A Dorfer

Senior Data Scientist @ BCG. I mainly write about data science and technology.