15 Lessons We Learned About Successfully Managing Data Science Projects

Published in

CraftData Labs

4 min readOct 27, 2017

During the short time, CraftData Labs has existed, we’ve been lucky enough to work on many different types of Data Science projects. These range from doing original AI research, productionizing real-time Machine Learning pipelines, creating Data Products, to creating Custom Dashboards.

However, along with the successes, we’ve also made our share of mistakes. This post is our attempt at sharing some of our hard earned knowledge working on a wide variety of complex Machine Learning projects. We hope this will be helpful to others in avoiding the same mistakes we made. Sharing some of your own experiences in the comments is highly encouraged.

Get a comprehensive description about the project and the data from the client/stakeholder before starting. Ask for as much documentation, metadata and context about the data as possible. It is better to have problems in the beginning than when you are committed to something.
Don’t try to do estimation to impress and give unreasonable timelines. In fact, overestimation will be better in most cases because something will invariably go wrong in the data collection and preparation phase.
When doing the initial assessment of the data be really thorough. Have a standard checklist of things ready for every project. Most important potential issues to look at are things like — badly formed data fields, missing dates, date format inconsistencies, extreme or outlier values in columns, missing values and nulls etc.
Be really really careful about how you handle NULL values. Be aware of the implications of the assumptions you make. For example, do you use a blank string, a NULL string, numeric 0 or the native NULL character of the platform you are developing on. They all don’t have the same meaning in most cases.
Having good features is more important than using the latest models. Although the latter might take precedence in some projects, carefully study the domain of the projects. Make sure you create as many features to capture the relation between the variables as you can. This will give you a more generalizable model with a higher prediction accuracy over a greater variety of data.
When creating training and test data sets, make sure to have a proper balancing of classes/labels applied in training and test sets. Try to use some form of stratification to achieve this. That helps a lot in improving the accuracy of the model.
Having stale or outdated data can degrade the algorithm gradually. This is especially true for time-series data. If this type of degradation of performance is noticed make sure to check that the data is fresh (as can be) first and retrain the model.
Incorporate comprehensive tests in your production data pipelines as well as model training code. The test environments for the two phases should be independent of one another.
Test getting data into the training algorithm. Check if all the important feature columns are populated. Manually inspect the input to your test set as well.
Test getting models out of the training algorithm. Make sure that the model in your production environment gives the same score as the model in your training environment for the same dataset.
Database read and writes are expensive in production. This is true especially for real-time systems. Make sure to serialize the model coefficients in-memory wherever possible.
Machine learning has an element of unpredictability. So, make sure that you have tests for the code used in training and production. Always use K-fold cross validation to tune the model parameters. Update the production with retrained model as frequently as possible.
Try ensemble of 2–3 models for added accuracy. This should happen at the end after you have optimized the initial model as much as you can. Then, use voting classifier using soft or hard rule voting.
Do sanity checks right before you export the model. Make sure that the model’s performance is reasonable on test dataset. If you have lingering concerns with the data don’t export a model.
Make sure the user/stakeholder isn’t being misled by your results. This may cause really bad things later on. In fact, having no results is preferable to wrong results. A wrong model deployed to production may be disastrous in terms of costs, time and reputation even. There are many reasons the best Machine Learning model won’t give us the prediction accuracy we desire. Although, this doesn’t mean the results have to be perfect. Be honest about interpreting the end result to other stakeholders who are not well versed in Data Science.

15 Lessons We Learned About Successfully Managing Data Science Projects

Written by Saurav Dhungana