GZIP + KNN > BERT for Text Classification??

Sehyun Choi
3 min readJul 14, 2023

--

Review of the ACL2023 GZIP paper — “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

TLDR;

GZIP-based compressed representation (of pure strings) + compressed space distance + KNN-classifier > BERT!

Motivation

Deep Neural Networks are very strong learners that solve a lot of different tasks. But sometimes, they are overkill for simpler tasks such as topic classification, as they are data-hungry, compute-costly, and require hyper-parameter tuning to work.

This work focuses on a simpler alternative, “compressor-based text classification,” which is ridiculously uncomplicated yet shows surprisingly strong performance.

Method

There are three parts:

  1. Plain-old lossless compression algorithm (gzip in this work)
  2. compressor-based distance metric (Normalized Compression Distance in this work)
  3. k-NN classifier

In fact, the method is so simple that it can be written in 14 lines of Python code (including import statements):

import gzip
import numpy as np
for (x1 , _) in test_set:
Cx1 = len(gzip.compress(x1.encode()))
distance_from_x1 = []
for (x2 , _) in training_set:
Cx2 = len(gzip.compress(x2.encode())
x1x2 = " ".join([x1 , x2])
Cx1x2 = len(gzip.compress(x1x2.encode())
ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)
distance_from_x1.append(ncd)
sorted_idx = np.argsort(np.array(distance_from_x1))
top_k_class = training_set[sorted_idx[:k], 1]
predict_class = max(set(top_k_class), key=top_k_class.count)

That’s it! How does this work?

Theory

The main intuitions are:

  1. compressors → capture regularity
  2. objects from the same category → share more regularity

Let C(·) be the length of the compressed bitstring. If x1 and x2 share the same category and x3 is from a different category, then

C(x1, x2) — C(x1) < C(x1, x3) — C(x1)

since x1 and x2 are from the same category sharing more regularity, less number of bits are required (less entropy) to compress them compared to x1 and x3. This intuition leads to compressor-based distance metrics called “Kolmogorov complexity” and “information distance”: but both of them are incomputable, and therefore the authors used a computable approximation called Normalized Compression Distance (NCD) instead, calculated as below.

NCD(x, y) = (C(x, y) — min(C(x), C(y))) / max(C(x), C(y))

Which is in line 10 of the Python code above.

Results

In-Domain

Below are the results in the in-domain datasets:

Key points:

  1. gzip, which is a training-free, preprocessing-free approach (even modality free, as it can take any digital format as input) showed better performance than most previous baselines, except for BERT-based models.
  2. It even surpasses SentBERT in DBPedia and R8 datasets.

Out-of-Domain (OOD)

And below are the results in the OOD datasets. Here, OOD means that the datasets are not included in BERT pre-training. Full means full-finetuning, 5-shot is 5 labeled examples per class.

gzip outperforms all.

Critics

  • Only likely to well when GZIP actually compresses (compress rate < 100%)
  • Short text will be difficult.
  • especially, negation handling is questionable, as one word will completely change the meaning of the context, but its semantics won’t be caught by the compression algorithm.
  • Not working well for images.

Takeaways

This work suggests that we have been too focused on making deep learning work, by adding more data and increasing size, that we lost sight of traditional methods…

This also adds interesting insight to the community, as we often refer to embeddings / latent variables as “compressions”, whereas they literally use compression. If compressors work so well, should this work as another inductive bias toward learning representations, say, design another loss that encourages lossless compression?

--

--