**Part-4 Data Science Methodology From Modelling to Evaluation**

From Modelling to Evaluation

Welcome to the data science methodology. Till now we have seen all ** 3 stages of data science methodology** from

**We have discuss amazing example with case study approach if you haven’t read this article series read from below links. and already read that go directly with this articles. In this article, You can learn about how to select the model and how to evaluate that model or this model is ready for deployment or not.**

*Problem to approach, Requirement to collections, Understanding to preparation.***Article Series :**

**Overview of Data Science Methodology****Part-1 Data Science Methodology- From Problem to Approach****Part-2 Data Science Methodology From Requirement to Collection****Part-3 Data Science Methodology From Understanding to Preparation****Part-4 Data Science Methodology From Modelling to Evaluation****Part-5 Data Science Methodology From Deployment to Feedback**

# #1) Modeling

Modeling is the phase of the methodology of data science in which the data scientist has the opportunity to taste the sauce and determine if it needs more seasoning or if it needs more seasoning !

*This part of the course is designed to answer two key questions:*

*First, what is the purpose of data modeling, and**Second, what are the characteristics of this process?*

**Data modeling** focuses on the development of *descriptive or predictive models.*

- An example of a descriptive model might be the following: if someone did it, they probably prefer it.
- A
**predictive model**attempts to provide**yes / no**results or to**stop / continue.**These models are based on an analytic approach learned either statistically or Machine learning. The Data Scientist will use a training set for predictive modeling. . The training set serves as an*A training set is a set of historical data in which the results are already known*.*indicator to determine if the model needs to be calibrated*- At this point, the data scientist will use several algorithms to ensure that the variables involved are really needed.

- The
**success**of*data collection, preparation and modeling depends on an understanding of the problem in question and the appropriate analytical approach.* - The data support the answer to the question and the quality of the ingredients in the kitchen is the basis of the result.
- Each step requires constant improvements, adjustments and tweaking to ensure the strength of the result.

*In the descriptive data science methodology of John Rollins, the framework is designed for three things:*

*First, understand the question that concerns you.**Secondly, choose an analytical approach or method to solve the problem.**Thirdly, obtaining, understanding, preparing and modeling data.*

The ultimate goal is to bring the data scientist to a point where it is possible to create a data model to answer the question.

- While dinner is being served and a hungry guest sits at the table, the key question is: have I prepared enough to eat?
*We hope that at this stage of the methodology, model evaluation , deployment and feedback cycles of the models will ensure that the response is relevant and near to the result*. - This relevance is essential for the whole field of data science, as it is a relatively new field and we are interested in the possibilities it offers.
- The more people benefit from the results of this practice, the more the field develops.

## Case study:

- The modeling is the phase of the methodology of data science during which the data scientist has the opportunity to taste the sauce and determine if it breaks or if it needs additional seasoning! Now apply the case study to the modeling phase as part of the data science methodology.

Here we will discuss one of the many aspects of model construction, in this case optimizing the parameters to improve the model.

- With a set of prepared training data, it is possible to construct the first classification model of the decision tree for congestive readmission for heart failure. We are looking for patients with high risk readmission. The result that will interest us will be a congestive readmission for heart failure equivalent to “yes”. In this first model, the overall accuracy of the classification of the results was 85% and not 85%. It sounds good, but represents only 45% of the “yes”. Actual readmission are ranked correctly, which means that the model is not very accurate.
- The question is :
. For the classification of the decision tree, the best parameter to adjust is the relative cost of the results yes and not classified incorrectly.*how to improve the accuracy of the model to predict the outcome itself ?*

When a true*Think of it this way:*and*non-readmission is misclassified***actions are taken to reduce the risk of this patient**, the*cost of this error is a wasted intervention*.

- A statistician calls this a
**Type I error or a false positive.**But, the cost of such an error is*when a real readmission is misclassified and no action is taken to reduce this risk*and all associated costs, as well as trauma to the patient.*readmission* - It’s a
**Type II error or a false negative**. Thencan be very different. For this reason, it is reasonable to adjust the relative weights of the incorrect classification of the results yes and no.*we can see that the costs of the two different types of incorrect classification errors* - The default is between 1 and 1, but the decision tree algorithm allows you to set a higher value for yourself.

- For the
**second model**, the relative cost was set at**9/1**. This report is very high, but provides more information about the behavior of the model. This time, the*97% model worked well*,*very low cost*Obviously,*with a general accuracy of only 49%.**this is not a good model.* - The
**problem**with this result is the,*large number of false positives***suggesting unnecessary**and**costly interventions**for patients that have never been re-admitted. - Therefore, the data scientist must try again to get a better balance between the
.*yes and no data*

- For the
**third model**, the relative cost was set to a more reasonable. This time,*4: 1 ratio***68% was obtained yes**, but statistician called it**sensitivity**, and**85% accuracy for the no**, called**specificity**. , with an overall**accuracy of 81%.** - This is the best balance that can be achieved with a relatively limited training set of workouts by adjusting the relative cost of the misclassified yes and no result parameters. Of course, modeling requires much more work, including an iteration in the data preparation phase, to redefine some of the other variables to better represent the underlying information and thus improve the model.

**#2) Model Evaluation**

A model evaluation goes hand in hand with the creation of models. The modeling and evaluation steps are performed iteratively. The evaluation of the model is carried out during the *development of the model and before deployment.*

*The evaluation evaluates the quality of the model, but also provides the opportunity to determine if it meets the initial requirements.*

*The evaluation answers the question:*

*Does the model used really answer the original question or should it be adapted?*

*The evaluation of the model can have two main phases.*

**The first phase**is the diagnostic measurement phase,If the*which ensures that the model works as intended.***model is predictive**, a. This allows areas to be displayed where adjustments are required.*decision tree can be used to assess whether the response provided by the model matches the original design***If the model is a descriptive model**that**evaluates the relationships,**a*set of tests with known results can be applied and the model refined as necessary.***The second evaluation phase**that can be used is the**statistical significance test.**This type ofcan be applied to the*evaluation*This is to*model to ensure that the model data is processed and interpreted correctly.**avoid a second unnecessary assumption when the answer is revealed.*

# Case study :

- Let’s go back to our case study to apply the
in the data science methodology.*Evaluation component*

- Let’s look for a way to find the
*optimal model*through a diagnostic measurement based on the configuration of one of the model’s construction parameters.*We will examine more closely how the relative costs of misclassifying positive and negative results can be adjusted. As shown in this table, four models were constructed with four different relative misclassification costs.*

- As we see, each value of this
, or the*model construction parameter increases the true positive rate*, of the*sensitivity*, to the detriment of a*accuracy in the prediction yes*. that is, an increasing*lower accuracy in the prediction no**rate of false positives.* - The question is,
For budgetary reasons, the risk reduction intervention could not be applied to most patients with heart failure, many of whom would not have been readmitted anyway.*which model is best based on setting this parameter?* - On the other hand, the intervention would not be as effective as it should be to improve patient care, since the number of patients with high-risk heart failure was not enough.

As you can see on this image above , the*So how do we determine which model was optimal?**optimal model is the one that provides the maximum separation between the blue ROC curve and the red baseline.*- We can see that
**model 3**, with a**relative cost of misclassification**of**4 to 1**,. And if asked,*is the best of the 4 models*, which was*ROC represents the characteristic operating curve of the receiver***first developed during World War II to detect enemy aircraft on a radar.** - Since then, it has also been used in many other areas. Today, it is commonly used in machine learning and data mining.
*The ROC curve is a useful diagnostic tool to determine the optimal classification model.* - This curve quantifies the performance of a
**binary classification model,**declassifying the results yes and no when a discrimination criterion is changed. - In this case, the criterion is a relative cost of misclassification. By plotting the true positive rate against the false positive rate for different values of the relative cost of misclassification, the ROC curve facilitated the selection of the optimal model.

*Thanks for reading…!!!Happy Learning…!!!*