Part-4 Data Science Methodology From Modelling to Evaluation

Ashish Patel
Aug 12, 2019 · 8 min read

From Modelling to Evaluation

Image for post
Image for post
Source : Coursera.org

Welcome to the data science methodology. Till now we have seen all 3 stages of data science methodology from Problem to approach, Requirement to collections, Understanding to preparation. We have discuss amazing example with case study approach if you haven’t read this article series read from below links. and already read that go directly with this articles. In this article, You can learn about how to select the model and how to evaluate that model or this model is ready for deployment or not.

Article Series :

  1. Overview of Data Science Methodology

#1) Modeling

Image for post
Image for post

Modeling is the phase of the methodology of data science in which the data scientist has the opportunity to taste the sauce and determine if it needs more seasoning or if it needs more seasoning !

This part of the course is designed to answer two key questions:

  • First, what is the purpose of data modeling, and

Data modeling focuses on the development of descriptive or predictive models.

Image for post
Image for post
  • An example of a descriptive model might be the following: if someone did it, they probably prefer it.
Image for post
Image for post
  • The success of data collection, preparation and modeling depends on an understanding of the problem in question and the appropriate analytical approach.

In the descriptive data science methodology of John Rollins, the framework is designed for three things:

  • First, understand the question that concerns you.

The ultimate goal is to bring the data scientist to a point where it is possible to create a data model to answer the question.

Image for post
Image for post
  • While dinner is being served and a hungry guest sits at the table, the key question is: have I prepared enough to eat? We hope that at this stage of the methodology, model evaluation , deployment and feedback cycles of the models will ensure that the response is relevant and near to the result.

Case study:

  • The modeling is the phase of the methodology of data science during which the data scientist has the opportunity to taste the sauce and determine if it breaks or if it needs additional seasoning! Now apply the case study to the modeling phase as part of the data science methodology.

Here we will discuss one of the many aspects of model construction, in this case optimizing the parameters to improve the model.

Image for post
Image for post
  • With a set of prepared training data, it is possible to construct the first classification model of the decision tree for congestive readmission for heart failure. We are looking for patients with high risk readmission. The result that will interest us will be a congestive readmission for heart failure equivalent to “yes”. In this first model, the overall accuracy of the classification of the results was 85% and not 85%. It sounds good, but represents only 45% of the “yes”. Actual readmission are ranked correctly, which means that the model is not very accurate.
Image for post
Image for post
  • Think of it this way: When a true non-readmission is misclassified and actions are taken to reduce the risk of this patient, the cost of this error is a wasted intervention.
Image for post
Image for post
  • A statistician calls this a Type I error or a false positive. But when a real readmission is misclassified and no action is taken to reduce this risk, the cost of such an error is readmission and all associated costs, as well as trauma to the patient.
Image for post
Image for post
  • For the second model, the relative cost was set at 9/1. This report is very high, but provides more information about the behavior of the model. This time, the 97% model worked well, but at a very low cost, with a general accuracy of only 49%. Obviously, this is not a good model.
Image for post
Image for post
  • For the third model, the relative cost was set to a more reasonable 4: 1 ratio. This time, 68% was obtained yes, but statistician called it sensitivity, and 85% accuracy for the no, called specificity. , with an overall accuracy of 81%.

#2) Model Evaluation

Image for post
Image for post

A model evaluation goes hand in hand with the creation of models. The modeling and evaluation steps are performed iteratively. The evaluation of the model is carried out during the development of the model and before deployment.

  • The evaluation evaluates the quality of the model, but also provides the opportunity to determine if it meets the initial requirements.

The evaluation answers the question:

  • Does the model used really answer the original question or should it be adapted?

The evaluation of the model can have two main phases.

Image for post
Image for post
  • The first phase is the diagnostic measurement phase, which ensures that the model works as intended. If the model is predictive, a decision tree can be used to assess whether the response provided by the model matches the original design. This allows areas to be displayed where adjustments are required. If the model is a descriptive model that evaluates the relationships, a set of tests with known results can be applied and the model refined as necessary.

Case study :

  • Let’s go back to our case study to apply the Evaluation component in the data science methodology.
Image for post
Image for post
  • Let’s look for a way to find the optimal model through a diagnostic measurement based on the configuration of one of the model’s construction parameters. We will examine more closely how the relative costs of misclassifying positive and negative results can be adjusted. As shown in this table, four models were constructed with four different relative misclassification costs.
Image for post
Image for post
  • As we see, each value of this model construction parameter increases the true positive rate, or the sensitivity, of the accuracy in the prediction yes, to the detriment of a lower accuracy in the prediction no. that is, an increasing rate of false positives.
Image for post
Image for post
  • So how do we determine which model was optimal? As you can see on this image above , the optimal model is the one that provides the maximum separation between the blue ROC curve and the red baseline.

Thanks for reading…!!!Happy Learning…!!!

References :

  1. https://www.coursera.org/learn/data-science-methodology

ML Research Lab

The vision of the ML Research Lab is to provide best…

Ashish Patel

Written by

Data Scientist | Kaggle Kernel Master | Deep learning Researcher

ML Research Lab

The vision of the ML Research Lab is to provide best technical tutorial to ML aspirant and Researcher to gain the Knowledge of Machine Learning, Deep Learning, Natural Language Processing, Statistics and Computer Vision.

Ashish Patel

Written by

Data Scientist | Kaggle Kernel Master | Deep learning Researcher

ML Research Lab

The vision of the ML Research Lab is to provide best technical tutorial to ML aspirant and Researcher to gain the Knowledge of Machine Learning, Deep Learning, Natural Language Processing, Statistics and Computer Vision.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store