Detecting defect segments by the Laplace smoothing

10 min readApr 24, 2023

TLDR

In this blog post, we will explain the Laplace smoothing method. The Laplace smoothing is used to tackle the tradeoff between the size effect and the statistically significant. This is a Bayesian technique. Namely, we update the probability after sampling new data.

Motivation

A few years ago, I was working on log file analysis. One of my tasks was to detect defect segments. This was challenging due to the two main constraints. The first is that the defect segment has a high failure rate. Otherwise, it would not be a defect segment. The second issue is the size of the segment. Small segments are more prone to a high failure rate. However, this does not necessarily mean much. This phenomenon is considered as the rule of small numbers. In addition, there is no business motivation to tackle small segments.

The naive approach is taking hard thresholds and namely, focusing on segments by hard criteria. This path can lead to a combinatorial explosion of threshold setting. Moreover, there might be edge cases that make this approach very messy.

Therefore, we understood that we need to construct a metric that is a trade-off between the failure rate and the size.

Segment

The notion of segment gain plays an important role in data analysis. In order to clarify to this we will use the GPT3 definition of a segment:

In data analysis, a segment refers to a subset of a larger data set that has been selected based on certain criteria or characteristics.

Segments can be created based on various factors such as demographics, geographic location, behavior, or purchase history, depending on the specific goals of the analysis.

This can be created by categorical, ordinal, or continuous features.

The Laplace smoothing

To overcome this we decided to take a more Bayesian approach. In particular, Laplace smoothing. For a given segment S of the data

Where

Fail is a function that takes the value 1 for a failure and 0 otherwise.
p is the probability of failure on the entire data set (the prior)
a is the Laplace parameter.

It is worth noting that for large segments, the Laplace smoothening is very similar to the value of the segment failure rate. This is because Fail(s) and |S| are large compared to a and a*p. For small segments, the Laplace smoothening would be near the entire set failure rate. Here a and a*p are large with respect to Fail(s) and |S|.

The Laplace parameter determinizes the statistically significant of the rate.

When I apply this to the logfile data the defective segments were easily spotted. It worked like a charm. It was so good that the domain expert could not believe we tracked these segments without deep knowledge of the field.

We end this section with the Python code:

def Laplace(S, fail_func ,p,a):

return (np.sum([fail_func(s) for s in S]) +a *p)/(len(S) +a)

Numerical example

To clear things up we will demonstrate next a numerical example.

Consider a set with three suspected segments. For the entire set, the failure rate is 5%. Consider three segments

Fail rate of 10% and a population of 1000.
Fail rate of 15% and a population of 100.
Fail rate of 30% and a population of 10.

Next, is a plot of each group’s Laplace smoothed fail rate with respect to the Laplace constant

Of course, the choice of the parameter is determined by the application. However, one can point out that for the first set, the fail rate is lower but it is more statistically significant. Akin to the K-means algorithm and the PCA we need to the inflection point of the above graph. This would hint at the choice of the parameter.

The above is a special case of conjugate prior. Where the prior is the Dirichlet distribution of order two. The posterior is also the Dirichlet distribution of order two with different parameters. The author encourages the readers to redo the above with other prior distributions.

The goodness of fit

After the success in the detection of the defect segments on the sample data, we thought to apply this technique to the predicted results of the model.

Accuracy

Analyzing goodness of fit plays a significant role in model performance analysis. Using the above, we can spot segments where the model is ill-preformed. We assume that the data is in Pandas df data frame format. In this data frame, the exist a predicted column and a truth column.

We define:

df[‘error’’] = df.truth != df.predict

Now we will try to focus on finding segments with significant error rates. This is done with the above. I.e., the Laplace smoothing. The failure function is the identity. We note that one interested in accuracy needs to define:

df[‘ accuracy’’] = df.truth == df.predict

df[‘ accuracy’’] = 1- df[‘error’’]

Thus, for each segment, we have the smooth accuracy operator:

Laplace(S, fail_func ,p,a)

Where

S is the segment.
fail_function is the success
p is the accuracy of the entire set
a is the Laplace parameter.

Recall, Precision, Specificity, and NPV

In many problems, the accuracy reveals only a partial description. In particular for non-balanced data. Therefore, the following metrics are introduced.

Recall- the probability of positive given the predictor predicts positive.
Precision -the probability of predicting a positive given the instance is positive.
Specificity — the probability of negative given the predictor predicts negative.
NPV — the probability of predicting a negative given the instance is negative.

Using the Pandas mask we can use the accuracy function to calculate the above.

For example, the smooth recall would be

S_recall= S[S.predictor]

Laplace(S_recall, S_recall.accuracy ,p,a)

where p is the Recall of the entire data.

Similarly, we can calculate for the other metrics. One can use gourpby: by predict for Recall and Specificity and by truth for Precision and NPV.

Composed metrics

The above metrics are very informative. However, in order to test the goodness of fit a single metric is needed. Thus, we average (in some sense) the above metrics. In particular

F1 score
Matthew correlation coefficient
AUC score of the ROC.

Next, we will show how to construct

F1 score

We recall that the F1 score is the Harmonic mean of the Recall and the Precision. Thus we have:

Replacing the Recall and Precision with the smoothed Recall and smoothed Precision we have the F1 — score with respect to the Laplace smoothening.

For the general F1 score see our goodness

The Matthew correlation coefficient

The Matthew correlation coefficient has been gaining popularity in recent years. Some research papers have shown this metric superiority over the F1 score.

To define this metric we need to recall the Geometric mean

Consider the following matrix

Then the determinant of the matrix M is the Matthew correlation coefficient.

Substituting the smooth metrics inside the matrix one gets the smoothed Matthew correlation coefficient.

AUC score of the ROC

AUC — ROC curve is a performance measurement for classification problems at various threshold settings. Consider the graph:

Where the

TPR is the Recall

FPR = 1 — Sensitivity.

We measure the area under the curve.

As in the above one can use smooth Recall and Sensitivity for the graph.

This method was used on the logs project very successfully. Yet, it can be used for other use of machine learning. Decision Trees, categorical data, and regression problems will be discussed next. We will start with decision trees.

Decision Trees

Decision Trees are a fundamental component in every tree’s methods, such as Random Forest, AdaBoost, XGBoost, and more.

One of the issues that one faces while constructing such a tree is choosing the depth. Choosing too deep might lead to overfitting. While too shallow to underfit. One approach is choosing a hard threshold, such as maximal depth. However, it might lead to that for some nodes the tree is too deep while for others it is too shallow. To overcome this one might choose a minimal instance in each node. However, for homogenous nodes, this does not make sense. Since we are using split with very little information gain. Thus, it seems logical to choose maximal information gain. This approach favors splits with a small number of instances.

To overcome this we define probability as in the above

Where p is the probability of the previous node in the tree.

This pseudo probability allows the construction of entropy measures. This would allow us to construct decision trees that overcome the node split issue.

Furthermore, instead of a single cut, one chooses two cuts. Namely, using cuts by two features. This would give him four sets instead of two. This approach

Here we choose a naive prior. By choosing a more sufficient prior one can achieve better results. This is called BART: Bayesian additive regression trees.

We define this for binary trees. Nevertheless, it can be generalized for multi-labels as we will see next.

Categorical features

An additional application of Laplace smoothing is for features that take categorical values.

We wish to check if the distribution of the values in a segment differs from the distribution of the entire set. This needs to be done with respect to statistical significance.

First, let’s define the frame without the smoothening. We will use a variant on the Chi-square test. Namely, for each categorical value we have:

Thus we can calculate

This form a metric between the distribution of the segment to the entire set.

However, taking the plain probability raises the problem we faced earlier. To tackle the problem of small data we define Laplace smoothing:

Where

#vali is the number of the value i that appears in the i-th segment.
p is the probability of the value i in the entire set.
a is the Laplace constant.

The problem is that if one sums up all the values he might not get the value one. I.e., this does not a probability measure. Thus we define

Therefore, we can use this to find segments with statistically significant different distributions.

This approach can be generalized into multi-features where we take the cartesian product of the categorical features. However, it is very likely to get small segments. Thus, taking more than two features might be nonefficient.

One can use this technique for numerical data by taking bins. Nonetheless, it might not be the best choice. Since we are omitting the topological order of such data.

If the number of categories is large. For example in words in the langue model. The smoothening might be too strong. This such cases the Good-Turing might be a better choice.

Regression Error Smoothening

As in the goodness of fit, analyzing regression performance plays a major role in model analysis. Detecting deprived segments is an essential step. As we review above, before reporting such a segment we need to be sure that it is statistically significant. Namely, taking the Laplace smoothening into the continuous domain. First, we define the Bayesian average:

Bayesian average

Define the Bayesian average

Where C is the Laplace constant.

μ is the entire set average.

And the Bayesian standard deviation

σ is the entire set standard deviation.

We note that the two Laplace constants might vary.

This can be a tool to see if a segment varies from the entire set by the mean. In particular, if the error of the regression model.

This provides us with a method to test the efficiency of a regression model in a statistically significant sense. From the point of view of fairness-AI is an important tool.

For the non-biased model, we have μ = 0 and σ is the l2 error up to normalization.

We note that both of these operators are not linear. Moreover, this does not apply to the definition of averages described in a previous blog post.

We note that the segment can be regarded as a data region. Namely, we can test if the regression is out of scope.

Closing remarks

We define the Laplace smoothing. We demonstrate how to use it. This can be useful for

In EDA
Goodness of fit
Decision tree.
Categorical data
Regression problems

This line of thought can be extended to target encoding, data drift, and more applications in machine learning.

Detecting defect segments by the Laplace smoothing

TLDR

Motivation

Segment

The Laplace smoothing

Numerical example

The goodness of fit

Accuracy

Recall, Precision, Specificity, and NPV

Composed metrics

F1 score

The Matthew correlation coefficient

AUC score of the ROC

Decision Trees

Categorical features

Regression Error Smoothening

Bayesian average

Closing remarks

Written by Uri Itai