A Deeper Look into the Extreme Value Theorem
In early 2017 I decided to cross the lines and move into the business world, to apply Cognitive Computing to information security. Yet computer and data science were always a huge passion of mine and I still enjoy reading a good paper. In our latest blog post, Adam Bali described how Cognigo uses the EVT to reduce uncertainty in document classification. This is a novel approach to apply statistical concepts into deep learning based NLP. Yet for all the techies/ML-fans out there, I decided to go deep into the statistical foundations of the EVT (Yeah!). Note: The major concepts of this post rely on published research from Walter J. Scheirer.
Motivation
The main idea, as Adam explained in his great article, as follows: given training samples polled from |L| different classes L= {l1, l2, …} find f(x) such as that if sample x “belongs” to a class which is a member of L return that class, otherwise return “other”. One may think about as a new supervised classification as where L_open = {l1, l2, l3,…, “Other”} yet how to you learn the underlining classes distribution with no examples? (You still have the probability constrain — Σp(x|L_open) = 1). That usually brings the discussion to a-parametric learning (SVM, random forests, etc…) — and methods like One-class SVM and isolation forest are in wide use today by the industry (yet are not a common subject in academia unfortunately). The shortcoming of these methods is that they don’t provide a distribution for the threshold/boundary for an out-of-class sample. In other words, we would like to get a confidence interval and not a margin. In Cognigo, we use similar ideas on the output of (highly) parametric learning methods (i.e Neural Networks).
The image below illustrates the challenge with our sample space X.

In the example above x could belong to L (i.e, either dog or an owl), but it can also belong to Others if it’s a raccoon.
Central Limit Theorem
The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This is something you usually learn in a statistics course in college, and it is one of the key results in modern statistics. The CLT also supports the ANOVA’s normality assumption, which is used in life science, economics and other fields of research.

For example, if we sample 100 times from a standard random distribution [0,1] and calculate the mean for 10,000 trails. The mean distribution will converge to the normal distribution. But remember, we could sample from ANY distribution and we’ll get a similar result.

The inference of the normal is quite trivial since we only need to find the mean and the variance of data using simple math.
Extreme Value Theorem and Funky Distributions
What if we replace the conditions of the last experiment, but we will not take the mean but instead we will take the maximum value over 100 samples, and then repeat that for 10,000 trails.

Of course, we didn’t get the normal distribution, but something else (maybe one-sided gamma distribution?).
Can we generalize this distribution in a similar fashion as we did with the CLT? This is the exact question that R. Fisher and Tippett asked themselves in 1928.

They actually proved that under several assumptions it can be actually generalized — but the math behind it is out of the scope of this blog post (you can read more about it here). In the example above, I generalized using the Weibull distribution which is the go-to distribution for extreme value inference.
Cool, but how does it help me with open-set recognition?
Now, we can infer the probability of sample X to be outside extreme values (maximum or minimum) we can call those values “boundaries”. You could also set a confidence interval for sample X to belong at each class, and also detect anomalies.
This method is also somewhat noise-proof, since we need a critical mass of evidences to change the shape of the distribution. In other words, if we would artificially introduce noise — such as a single value of 1.1 in our previous example, the Weitbull distribution that was inferred, wouldn’t have changed much.
For Deep-Learning NLP this is key, since the sample is not the document, yet the activations (or the document’s “signal”) are the sample space (“x”) that we work with. So the strategy is as follows.
- Train a model
- Get the activation / other metrics as signal
- Check whether the signal is within minimum/maximum thresholds, and if outside the boundaries— tag it as others.
I really believe the EVT can be a great tool to improve DL results and filter noise, and I’d love to hear feedback or ideas for how to improve open-set recognition.
