Before Model Building
In this article, we will discuss the challenges regarding collecting right data, data exploration, refining metrics and getting Product Manager on the same page.
Data Extraction: Getting the right data is the first and crucial part. It is often not that easy.
For the CRS, canned response is widely used, but there was no direct link for canned responses to the corresponding email in the DB. It was supposed to be in the logs. Nobody was tracking it. After finally finished crying for data. We relied on the Elastic Search to match the responses to the Canned response.
For CCP, given features are excellent given to predict for model but it is not useful. For example: “Customer exporting all the data” is an outstanding feature for customer churn, but it is too late for any action as the customer is already moved on. We collected past three-month activity for the model training.
For SA problem, we hired people to annotate. It might be costly if annotation requires domain expertise, such as medical, legal, etc.
Data Exploration: Once we collected the raw data, basic data exploration is the next step. Image shows the basic steps of the data exploration. Click here for more details.
Some problems that are rarely discussed are:
- Class imbalance:
- In FC, Even though, 90% of tickets contributed to two or three classes. Only 10% of tickets consume the remaining 50 classes.
- Getting More features:
- In FC, for the same ticket “we need VPN Access”, classes filled are different. Upon close inspection, we found different class are filled based on whether a person is Employee or Contractor. Thus we need to add the Employee Flag Feature
- Extracting Correct Feature
- For CRS, Extracting team feature, i.e. whether person is from the team (Sales, Marketing etc) from email id helps model to have better accuracy
- Merging Classes
- For CRS, some canned responses are almost duplicates. It is better to combine duplicate classes into single classes.
- For FC: some classes are often confused. Class “Network” and “Wifi” are often filled interchangeably. We should merge them.
Once data analysis is over. This is important to discuss with the Product Manager (PM) so that they have more context. Two problems often arise and some clarity has to given by data scientist.
- ML is probabilistic. It can not be always correct like traditional software engineer.
- Model performance always depends on the data. We can not achieve great accuracy in inferior quality or low quantity of data.
Next step is to talk to the domain experts and understand the scenario.
Once it is clear, it is important to derive metric from PM. They would be ambiguous, but in most cases, it would be clear.
For CCP: PM said that Predicting churn customer as non customer is blunder but predicting non-churn customer as churn customer is no big deal. Thus recall of churn customer is important than precision of non-churn customer.
For FC: PM wants to minimize only the number of corrections made by the end user. Thus accuracy is more important than recall/precision of individual classes.
Data Preparation: Now we have decided on the final data, we should store the data in a relevant format in the database for the model building. This process is called ETL. Model building model is the based on this data stored in ETL
Cross Validation: Cross Validation helps to find the right parameters for the model. Wrong parameters will cause over-fitting or underfitting of the model. For a detailed description, please refer to this link.
Cross validation should mimic real-world situation. For problem like sentiment analysis cases, K Fold or Leave out might be enough. FC, CRS, future data will be predicted on present data, thus Time Based CV is best assigned.
Thus Model preparing or boring stage is done. In the next article, we will discuss on Model Building. Please click here for next article.
Please click here to go to the first article