Checklist for Data Science Research Review

8 min readJan 23, 2020

This post is meant to be used as a checklist by Data Science professionals, while checking their own research, or the research of one of their peers. This is typically done before declaring “success!”, and deciding to deploy the model you’ve created to production (or to brag about the results to non-technical management).

Imagine that you’re a Data Science team lead, and one of your Data Scientists comes up to you and says “I’ve finally completed the research, and I’m performing better than the naive benchmark by 10%”. And then you scratch your head and think: “Wait, did I double check everything that’s important to double check? What are all of the issues that I’ve missed in the past?”

Similary, you may be a sole Data Scientist in your organization (or the most experienced in your organization), after reaching results that seem significant. But then your brain says: “But there’s nobody here to peer-review my work. How can I check myself?”

In either case, this post is for you!!!

I suspect that this list will go through a few changes (based on your feedback). When there are major changes it may be worth re-publishing, and after a few version it may be worth printing out =]

Edit: After posting this, Shay Palachy wrote a great follow up post which includes a much more thorough approach towards peer reviewing data science projects during their different phases. I would highly recommend reading/using his post as well, possibly even before reading this post. You can find it here.

Limitations of this List

The goal of this list is to check for common pitfalls regarding the data, the code and the modelling. Most of the focus is on clear “mistakes” that can be corrected, and not on implementing coding culture or best practices.

Why is the List so Long?

This list is meant to contain all of the important issues you should be checking in one place. It’s meant to be returned to when you’re actually checking a Data Science project, and not neccessarily to deeply dive into each of the issues. We’re planning to post more about the same topic from a few different points of view, so if you’d like to learn more about some of theses issues feel free to follow our LinkedIn page.

In any case, please let me know if this ever contributes directly to something you’re working on. And more importantly, enjoy!

Assumptions Regarding the Dataset

What does the way the sampling was done assume about the future data? In which cases will these assumptions be wrong?
Will all of the features available in the dataset be available in the future? Will any of them be calculated differently (and if so, will you know about it)?
Did characteristics of your dataset change over the time it was collected (e.g. different mix of genders, climate change, before/after people had cellphones)? Did you limit your dataset to periods of time during which the characterstics were similar to today?
Often your dataset is sampled to reflect a phenomenon in real life. What are the limitations of your sampling procedure? What can’t your sampling see?
Given that your data is collected over time, are there any low frequency phenomenons that might have been missed? Make sure you collect your data for a long enough time period.

Preprocessing

Make sure you used the same preprocessing procedure for your train and for your test
Did you normalize before modeling according to the best practice of that algorithm? (This doesn’t matter if model is decision-tree-based)
If you normalize the data: Did you calculate the parameters based only on the training data? If not, is the calculation repeatable in the production environment?
Is your scaling method sensitive to anomalies? Make sure there isn’t a single sample that strongly affects your scaling
Do you have any delta-like distribution for your features? If model isn’t decision-tree-based, consider using a non-linear scaling method (log, precentiles) to prevent such situations

Leakage and Bias (excluding time related issues)

Does the index contain any information?
Does the index, or any feature calculated while using the index, appear in the final features (before modelling)?
Train a simple decision tree or a random forest on your data and look at the feature importances it produces. Make sure that there isn’t a suspiciously important feature — that might imply a leakage.
Was any of your data generated? If so, check if it’s possible that generation was done differently for different classes/values of the label.
Did you collect samples of your data to train the model? If so, check if this was done differently for different classes/values of the label (e.g. positive samples from one city and negative from a different one).
How many experiments were run to obtain the final evalutation on the test set? If more than 5, this score may potentially include “leaderboard likelihood” (a term which I heard from Seffi Cohen and Nir Malbin for choosing your Kaggle final submission based on too many submissions to the public leaderboard).
Given that you expect your train segment to resemble your test segment, how different are they? Try training a simple model to tell them apart — If it’s significantly successful, this may be a source of leakage.
Can you split your data into train-test segments differently? If so, try doing so and make sure you get the same results.
Does the model performance seem reasonable/possible? Is there some known unavoidable error threshold that you have passed?

Causality

How long after your predictions do you expect to obtain new labels in the future? Is this true for all possible values of the label? Did you assume otherwise while planning your system?
Does the training set include any information from the future? (Specifically from the timeframe of the dev/eval/test set or later)
Do any of the features include information that neccessarily comes from the future? (e.g. moving average with a window extending both forwards and backwards)
When predicting “far into the future”, be careful when using a “rolling model”. Make sure that you use it as a “rolling model” during training.

Loss Function/Evaluation Metric

Does the loss function in the code measure what it should be measuring? (e.g. what was previously discussed)
Does the loss function match the evaluation metric? If not, is the connection between them monotonous?
Does the evaluation metric match the business metrics (including for all edge cases)? If not, can we add post-processing to take care of the edge cases?
For an ensemble or model with complex post-processing: Does minimizing the different loss functions ensure the optimization of the final output?
For Neural Nets: Did the loss function really stop improving? Is it smooth and flat towards the end of the training, or does it exhibit different behaviors (noisy, still going down)?
For Neural Nets: Does the training process look reasonable? Do the train and validation over epochs graphs exhibit standard behavior?
For custom/uncommon loss functions: What kind of anomalies is your loss function sensitive to? Make sure it doesn’t get extremly high values or have unbounded gradients for specific data samples.

Overfit

Does changing the random seed change the results dramatically?
Does evalutating your results on a random sample of the test set obtain the same results?
Did you use a few different folds while evaluating your model? (While making sure the dev/eval/test set don’t appear in any of them)
How did you preform hyper parameter tuning/feature selection? make sure you used the training set only.
Look at your train and test graphs as a function of a complexity parameter (training epochs, depth of tree, ect.). Use it to make sure you aren’t overfitting on your training set.
Can you obtain similar results while reducing the number of parameters in your model? (Note that this point is conterversial from an academic point of view, especially for Neural Nets)

Runtime

Can you obtain similar results while reducing the number of features in your model? (Similar to the previous point but a bit different)
Which feature takes the longest to calculate? How much of the runtime is spent for this process? Is it’s addition to the accuracy worth the runtime?
Which hardware did you assume you’ll have in production? Assuming you’re using a machine that costs half the price, is there still a way to obtain similar results?
For an ensemble: How much better is this than your single best model?

Stupid Bugs

Was the original index column ever deleted? If so, which mechanism was used to make sure the order stayed the same?
Were the names of the columns ever removed? If so, which mechanism was used to make sure the order stayed the same?
Does the index column of the label completely match the index column of the features?
Were there any merges/joins during the preprocessing phase? If so, did any of them: a. Create a lot of nulls. b. Change the number of rows to a number that wasn’t expected
Was any dictionary loaded to memory during preprocessing? If so, is it the correct version of the dictionary?
Is the model you are using the model that yields the reported results? (i.e. make sure you are using the correct weight file)

Trivial Questions that Must Be Asked

Is there any important library that you didn’t install and experiment just because of technical issues/laziness?
Is there any feature that someone else in the organization is creating especially for you? If so, did you check if you can do without it?
Did you compare with an intelligent benchmark which doesn’t use machine learning? (e.g. average value, most common class, random labels with the same distribution as the training set)

Conclusion

There are a million things you should be checking, and it’s very hard to find a shortcut which lets you avoid this. This list may help save some time that you may otherwise use either remembering your past mistakes, or calling up colleagues and asking them about their past mistakes. However, more importantly, hire good people! And spend time and energy on training (your people, not only your models)!

Credit Where It’s Due

For years I’ve been waiting to stumble across a list of a list of this sort, but nobody (that I’m aware of) stepped up to the plate. I started working on it recently after the need came up within a group (which I love =]) of Data Science team leads. I’d like to thank Omri Allouche, Ittai Haran and Amiel Meiseles for their contributions to this list. I’d also like to thank Aviad Klein, Ori Cohen, Noam Bressler, Shay Palachy and Shir Chorev for their important comments regarding this post.
As stated at the beginning of the post, I highly recommend also checking out Shay’s follow up post, which has a more holistic approach towards peer reviewing data science projects at their different phases.
If there’s anything else you think I should add, please reach out (via LinkedIn) and I’ll do my best to try to fit it in.

Philip Tannor is the co-founder and CEO of deepchecks, a company that arms organizations with tools to check and monitor their Machine-Learning-based systems. Philip has a rich background in Data Science, and has experience with projects including NLP, image processing, time series, signal processing, and more. Philip holds a M.Sc. in Electrical Engineering, and a B.Sc. in Physics and Mathematics, although he barely remembers anything from his studies that doesn’t relate to computer science or algorithms.