Seven Pitfalls to Avoid in Machine Learning
As I am starting my career in data science, I gradually realize the differences between academic projects and industrial applications.
The setting of an academic project tends to be simple and straightforward, just like one can see in the Scikit-learn Pipeline— some encoding and transformation, train-test split, train the model, then evaluate the model. But there are a lot more pitfalls in real-world applications.
The idea of this post comes from my work experience as a newbie data scientist. It is definitely not an exhaustive list and I am looking forward to your feedback!
So let’s get started!
1. Primary key leakage
This is the most identifiable pitfall. It might sound stupid but sometimes we do forget to remove the ID from our feature sets.
y = df[“target”]X = df.drop(columns=“target”) # ID is still kept!
Most of the time, this can be easily spotted from the fail to converge in regressions, the huge difference between train accuracy and test accuracy, the unreasonably perfect shape for ROC curve…
However, including ID does not necessarily result in high performance. Below is the ROC curve I got, by including appointment ID as a numeric feature to predict no-show appointments. The logistic regression had a “reasonably” poor performance and failed to give an alert.
Therefore, the best solution is to always check for unique identifiers when building your dataset for prediction.
2. Composite primary key leakage
This one is less obvious, and might be hard to identify when one lacks background knowledge. For example, using ZIP, date of birth and gender together, we can uniquely identify most people in the United States¹. Including all of them as features can make the model generalize poorly.
3. Time machine
This is a mistake one will make if he/she didn’t really think through the use case e.g. using the actual flight time to predict flight delay. Actual flight time cannot be observed before flight delay status is observed. We cannot see the future with a time machine.
Sometimes it works more like Schrödinger’s Cat. Once you have observed b, the uncertainty dissolves and you are certain about the outcome. This sounds like a good feature with great variable importance, but it can also be a bad feature that tells nothing interesting. In the appointment no-show example, if you know the check-in time for a patient, you immediately know he/she showed up for the appointment. It’s meaningless to use “check-in time” for prediction.
4. Label as predictor
Yes, this happens. In appointment no-show prediction, we have the information “whether a patient has called to cancel or not”. Canceled appointments should be treated at the same level as the No-show appointments. They all represent appointment status. If we use “whether a patient has called to cancel or not” as one feature for prediction, the model performance might be unreasonably high, but meaningless. In this case, we can either add “canceled” as a label (if it’s not there yet), or we can use a rule-based statement to quickly determine appointment status rather than confuse the model.
if call_to_cancel == True:
return 'Canceled appointments'
else:
...
5. Different populations
Different populations mean the population we use in the training set and test set is different from what is in the real-world case. For the appointment no-show problem, one appointment can have several possible outcomes — no-show, completed, rescheduled, canceled. Are we going to only include samples with “no-show” and “completed”, or keep all of them but the group “completed”, “rescheduled” and “canceled” into the same category, as “no-show” is 0? In my opinion, the first approach is problematic. It manually skews the sample distribution, making it different from the true population. When new data comes in and its true status is “rescheduled”, the model might have no idea how to classify it, for it might bring in new unseen patterns.
6. Biased representation
Bias in machine learning is a huge topic. Simply put, bias can be introduced during data collection, data preparation, data modeling or come from the world itself. It’s a complicated issue. More articles should be dedicated to this topic.
The biased representation here, is about certain populations appearing more/less in the dataset in a way we don’t want. When predicting the 3-month in-hospital mortality on a rolling basis as shown below, patients who have longer hospital stays will have more representations in the sample set. Subsampling can be used to solve the issue e.g. randomly sample 3 samples from one patient’s full data.
7. Lack of practical utility
Since I am working in the healthcare industry, for me this is mostly about clinical utility.
A common mistake is looking backward when making predictions². For example, when making predictions about in-hospital mortality, one looks backward from the end of the admission all the way to the beginning of the admission. Specifically, predict in-hospital mortality during the whole hospital stay, using data 1 month before the end of the admission. However, in the real world, we don’t know when is the end of the admission for a patient so the model cannot be used.
Another example can be, using 1-hour data for prediction while the actual data ingestion has a delay of 2 hours. For example in practice, you can’t see the data at 10 am when it’s 11 am. You can only get the “stale data” at 9.
Conclusion
As mentioned before, the list is not exhaustive and some of them are not strictly exclusive e.g. “Different population” is similar to “Biased representations”. Essentially, it’s all about simulating use cases when designing, building and testing models.
Thanks for reading. I hope you find this article useful!
You can also reach me out on LinkedIn.
References
[1] Latanya Sweeney. (2000). Simple Demographics Often Identify People Uniquely
https://dataprivacylab.org/projects/identifiability/
[2] Eli Sherman, Hitinder Gurm, Ulysses Balis, Scott Owens, Jenna Wiens. (2018 Apr 16). Leveraging Clinical Time-Series Data for Prediction: A Cautionary Tale