What to keep in mind to successfully build a good machine learning model

Lewi Kim
ailys
Published in
7 min readJan 24, 2022

In the previous post “Basics of Machine Learning”, we briefed through what exactly machine learning meant and from this what types of models we could retrieve, not to mention what learning methods there were. This session we will focus on supervised modeling specifically and learn the points we have to acknowledge in order to develop a good model.

1. Definition of model development

As we’ve learned earlier, #DAVinCI LABS is a solution that automates and optimizes development of machine learning model. In that case, would we get an excellently performing machine learning prediction model once we upload any data on DAVinCI LABS?

To give the answer right away, it’s incorrect. As a matter of fact, we could say it’s half right and half wrong. How so?

Major crisis: which data should we use for the best modeling?

To figure out the reason, we should closely observe the term ‘development of machine learning model’. Although DAVinCI LABS may complete the process by training on data via algorithms, there are two major premises to achieve “applicability for prediction”: first of all, data should be perfectly set suited to its purpose, and the model should be evaluated upon and chosen appropriately for its initial purpose, in the right form.

Ultimately, prior to model development, there should be a precedent process of determining the purpose as well as constructing the basis- the action itself could be automated by DAVinCI LABS, but setting its purpose and the target is left upon the user’s hands. Of course, the act of automation itself is extraordinarily appealing considering that general machine learning requires data experts to code algorithms all the way.

2. Process of model development

Target determination → Data preparation → Data feature-engineering → Optimization of algorithm → Model evaluation and interpretation → Model management

Development process of machine learning ranges from target determination to model management. The domain that DAVinCI LABS gets involved is from data feature-engineering to model extraction and its management.

In other words, we could say that it relies on human subjective when it comes to target determination, data preparation and management- basically the domain outside automation. Decisions based on given data could be handled by AI, however, assessing the results and molding them into business strategies could not be executed except for humans.

The process mentioned above resembles the relationship between the teacher and the student. The student can train oneself by solving problems(data) and grasp the optimum solving method(algorithm), thus enhance one’s skill (performance)- however, it relies solely on the teacher’s performance to set goals on what to teach, what the best morals should be, how to evaluate a student’s academic achievement and so on. Such roles are what DAVinCI LABS’ users are left to take charge of.

We should consider machine learning a student brilliant in data training and prediction.

In other words, we could confidently say the following:

The key to developing a predictive machine-learning model lies in construction of a clear target

Within this context, the first step to developing an excellent machine-learning predictive model would be clearly setting the ‘training objective’ and the ‘definition of subject’. We’re inclined to end up with a poorly trained model unless the model is precisely informed on what exactly it should learn and what specific result it should generate.

Before we develop the model, it’s mandatory that we clarify our subject. It could be customer churn rate or price prediction. The subject directly connects to the user’s domain and task comprehension. Only after accompanied by diverse background knowledge extracted from business experience and understanding on data could be clarified the available subject via containing data. Once the subject and the prediction target are clearly set, you could confidently say you charged across the first line to making a great prediction model.

Worthy results derive from worthy data

After determining the subject, we should decide which data to put into DAVinCI LABS. Would we be able to develop a brilliant prediction model right away, just by scraping in all containing data? Of course the answer is no. It’s along the same context of training students on Korean, English, and Math when the student requires focus on math only.

There’s a renowned saying related to machine learning: “Garbage in, Garbage out”. It means if you put useless data, you will retrieve only useless results. That’s precisely why an appropriate decision by the business user (“teacher/ director”) is required even in the process of data set establishment.

3. Model performance evaluation

After DAVinCI LABS trains on the given data and develops a predictive model, now comes the need to decide which model would be appropriate, and if it would align with the current objective.

DAVinCI LABS provides apt performance index according to the type of given subject (target field). The performance is conveyed in numerical form, but its high and low does not determine the absolute standard of the performance index. That’s because the apt type of index changes depending on the model developing objective. Naturally, the user should refer to the recommended index for assessment of the initially set subject in order to completely achieve the prediction goal.

Let’s take an example. Under the assumption that DAVinCI LABS has developed a model predicting customer churn rates, there exist two complete models with performance almost similar to each other.

Confusion matrix of Model 1(left) & Model 2(right)

Above is the captured image of the two models’ #ConfusionMatrix. Confusion Matrix is a table comparing the correct and wrong decisions between the predicted and actual values of the model.

Model 1.

-Predicts the retentive customer correctly: 1437

-Predicts the retentive customer incorrectly: 147

-Predicts the churning customer incorrectly: 135

-Predicts the churning customer correctly: 250

Model 2.

-Predicts the retentive customer correctly: 1545

-Predicts the retentive customer incorrectly: 39

-Predicts the churning customer incorrectly: 207

-Predicts the churning customer correctly: 178

If you were the one in charge, which model should you select? If you were focusing on customer retention and churn prevention, Model 1 in the left would be the apt choice, which “makes predictions that are precise and in a large scale”.

However, what if we assume that the priority is to promote the most efficient method? Such situation derives from a limited marketing budget, thus having to focus on a small scale of customers. Model 1 predicted an estimate of 397 (147+250) customers and defined them as “alert customers (those likely to churn)”, resulting 62.9% of accuracy (250/391=62.9% of marketing efficiency). On the other hand, Model 2 predicted 217(39+178) as “alert customers” and recorded 82% accuracy (178/217=82%). In this case of a limited budget, Model 2 would be the suitable choice.

As such, model assessment should accompany business experience and subject goal of the user in order to achieve a good prediction model. Soon we’ll cover the meaning of diverse performance index and how to utilize them.

4. Models age as well!

Now is the final step of model development: managing the completed model. It could be questioning in the sense that we still need the user’s decision for already generated models. Why we need human interference is due to the fact that models and data turn old, just like us humans.

What has gotten into your mind, bringing ancient data in a monument?

Time is equal to everyone that it applies to machine learning models as well. Time ticks over the process of model development and management, resulting in changes of data characteristic and tendency when compared to the initial point of development. Certain models tend to maintain its initial prediction performance, but the majority gradually deteriorate in its predictive performance over time progress.

Therefore, business experts, in other words, the users, should decide which would be a good time point for retraining in order to maintain the model performance at its optimal state. Retraining means to train the model with newly updated data, by reflecting recent data onto the existing training set.

Model performance drops over time
Frequent updates refine the model and so the model performance shoots back up!

Fortunately, DAVinCI LABS contains a model update function that enables regular performance management. As shown below, by setting the update cycle and time point, you can be notified with update alarms, so that the user can train the model with new data.

That’s all for today- we’ve gone over the core necessary points of developing a good model. We hope you’d look forward to our future posts where we would discuss data feature engineering and algorithm optimization!

--

--