GZIP + KNN > BERT for Text Classification??
Review of the ACL2023 GZIP paper — “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
TLDR;
GZIP-based compressed representation (of pure strings) + compressed space distance + KNN-classifier > BERT!
Motivation
Deep Neural Networks are very strong learners that solve a lot of different tasks. But sometimes, they are overkill for simpler tasks such as topic classification, as they are data-hungry, compute-costly, and require hyper-parameter tuning to work.
This work focuses on a simpler alternative, “compressor-based text classification,” which is ridiculously uncomplicated yet shows surprisingly strong performance.
Method
There are three parts:
- Plain-old lossless compression algorithm (gzip in this work)
- compressor-based distance metric (Normalized Compression Distance in this work)
- k-NN classifier
In fact, the method is so simple that it can be written in 14 lines of Python code (including import statements):
import gzip
import numpy as np
for (x1 , _) in test_set:
Cx1 = len(gzip.compress(x1.encode()))
distance_from_x1 = []
for (x2 , _) in training_set:
Cx2 = len(gzip.compress(x2.encode())
x1x2 = " ".join([x1 , x2])
Cx1x2 = len(gzip.compress(x1x2.encode())
ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)
distance_from_x1.append(ncd)
sorted_idx = np.argsort(np.array(distance_from_x1))
top_k_class = training_set[sorted_idx[:k], 1]
predict_class = max(set(top_k_class), key=top_k_class.count)
That’s it! How does this work?
Theory
The main intuitions are:
- compressors → capture regularity
- objects from the same category → share more regularity
Let C(·) be the length of the compressed bitstring. If x1 and x2 share the same category and x3 is from a different category, then
C(x1, x2) — C(x1) < C(x1, x3) — C(x1)
since x1 and x2 are from the same category sharing more regularity, less number of bits are required (less entropy) to compress them compared to x1 and x3. This intuition leads to compressor-based distance metrics called “Kolmogorov complexity” and “information distance”: but both of them are incomputable, and therefore the authors used a computable approximation called Normalized Compression Distance (NCD) instead, calculated as below.
NCD(x, y) = (C(x, y) — min(C(x), C(y))) / max(C(x), C(y))
Which is in line 10 of the Python code above.
Results
In-Domain
Below are the results in the in-domain datasets:
Key points:
- gzip, which is a training-free, preprocessing-free approach (even modality free, as it can take any digital format as input) showed better performance than most previous baselines, except for BERT-based models.
- It even surpasses SentBERT in DBPedia and R8 datasets.
Out-of-Domain (OOD)
And below are the results in the OOD datasets. Here, OOD means that the datasets are not included in BERT pre-training. Full means full-finetuning, 5-shot is 5 labeled examples per class.
gzip outperforms all.
Critics
- Only likely to well when GZIP actually compresses (compress rate < 100%)
- Short text will be difficult.
- especially, negation handling is questionable, as one word will completely change the meaning of the context, but its semantics won’t be caught by the compression algorithm.
- Not working well for images.
Takeaways
This work suggests that we have been too focused on making deep learning work, by adding more data and increasing size, that we lost sight of traditional methods…
This also adds interesting insight to the community, as we often refer to embeddings / latent variables as “compressions”, whereas they literally use compression. If compressors work so well, should this work as another inductive bias toward learning representations, say, design another loss that encourages lossless compression?