Data Science Is More Than Machine Learning: An Enterprise-Grade Data Science Case Study (With Docs and Code)

Calvin De Lima
4 min readDec 26, 2018

--

As aspiring data scientists or data scientists new to the field (and working in a business-centric rather than a research-centric role), it is easy to be blinded by the technical aspects of data science such as exploratory data analysis and machine learning, failing to see the forest for the trees.

My goal with this post and accompanying GitHub project is to provide a more accurate example of an enterprise-grade data science project that includes what is often missed in these sorts of examples: a rigorous understanding and articulation of the business problem.

The majority of the time spent to complete this project was allocated to the understanding and definition of the business problem rather than optimizing a machine learning model. As with any other complex problem, spending some time upfront to work through the various aspects of a business problem in a systematic way will reduce the risk of coming up with the wrong solution, which clearly wastes time (read: opportunity loss) and money.

Another Data Science Example?

There are tons of resources available that discuss and teach us about fundamental aspects of data science including:

  • Machine Learning & AI
  • Statistics & Probability (including Experimental Design and Causality)
  • Software & Systems Development and Engineering

Some do a great job at providing an integrated view of at least some of these aspects but I have yet to find a resource that provides a true end-to-end example of a real-world, enterprise-grade data science project.

While most resources focus on the technical aspects of data science (exploratory analysis and machine learning), I haven’t found any that covers an adequate depth of the most critical aspect — thinking deeply about the business context and needs. As a result, I decided to create a detailed case study of working a common business problem — lead scoring — under a real-world time constraint. Assuming a median data scientist salary of ~$73K [1] and a 40-hour work week, this project would have costed approximately $2.8K to complete and is poised to deliver $187.5K in value, under some strong assumptions.

A Thoughtful Approach

This project uses a real-world dataset [2] and addresses a real-world problem (lead scoring), taking a business-first rather than a machine learning-first approach. The narrative that frames the data is slightly contrived and parts of it may or may not align with reality but in the real-world, data scientists need to collaborate with project stakeholders to address strong assumptions such as the ones that are made here, to increase the likelihood of a successful project.

You’ll find in the project’s GitHub repo [3] common artifacts of enterprise-grade data science projects including: documentation / project scope, reproducible analyses and experiments, and machine learning code that was extracted from preliminary notebooks into reusable Python modules. The artifacts capture the considerations required to scope and deliver a data science solution that addresses a business problem through the application of critical thinking, exploratory analysis, experimental design and execution, and software engineering.

The primary deliverable of the project is a lead scoring model that could be used to support a bank’s telemarketing campaign for a term deposit product [4]. To some, the notebooks may be the most interesting part, but I recommend that you to start by reviewing the documentation in the docs folder to gain a deeper understanding of the business context and value of moving forward beyond a project scope.

While this case study presents lead scoring as a specific example of a business problem upon which data science can be brought to bear, the results could be seen as supporting a higher-level mandate of using limited sales resources more efficiently. From a strategic perspective, the results may directly support competitive differentiation and cost-leadership.

The Key to Success

There is arguably nothing more important to a successful data science project that generates a positive ROI than deeply understanding the problem you are trying to solve and scoping it appropriately. To this end, the CoNVO framework [5] was applied to synthesize a complete description of the business problem and a formulation of the data science solution.

Once the problem was defined and possible solutions were identified, the iterative process of:

  • Refining the business understanding
  • Understanding the data
  • Preparing the data for analysis and modelling
  • Designing and executing baseline and advanced analytics (machine learning) model experiments
  • Evaluating the results

was carried out to understand how the project’s efforts aligned with expectations. The steps I described above are part of a more technical framework, CRISP-DM [6], that should be applied once the project has been scoped and signed off on. The framework actually includes a sixth step — Deploying the model and evaluating the results — which I have left out for the sake of time (and really, there are an abundance of examples out there).

While the process could have continued indefinitely with the hopes of further improving the predictive performance of the lead scoring model, the results after a few iterations would have been sufficient enough to evaluate in a real-world experiment

Conclusion

As data scientists in an enterprise or for-profit role, our job is often to design solutions to business problems that are provably likely to create a positive financial impact on the organization. By applying best practices and being systematic in our pursuit of data-driven decision making, we maximize our chance of success, reduce costs, and make it easier to learn about what needs to be done better next time.

I would love to hear any questions, thoughts, or feedback you have about this post so I would encourage you to comment below!

[1] https://www.payscale.com/research/CA/Job=Data_Scientist%2C_IT/Salary

[2] https://archive.ics.uci.edu/ml/datasets/bank+marketing

[3] https://github.com/calvdee/end-to-end-lead-scoring

[4] https://www.investopedia.com/terms/t/termdeposit.asp

[5] http://shop.oreilly.com/product/0636920029182.do

[6] https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining

--

--