# GZIP + KNN > BERT for Text Classification??

Review of the ACL2023 GZIP paper — “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

# TLDR;

GZIP-based compressed representation (of pure strings) + compressed space distance + KNN-classifier > BERT!

# Motivation

Deep Neural Networks are very strong learners that solve a lot of different tasks. But sometimes, they are overkill for simpler tasks such as topic classification, as they are data-hungry, compute-costly, and require hyper-parameter tuning to work.

This work focuses on a simpler alternative, “compressor-based text classification,” which is ridiculously uncomplicated yet shows surprisingly strong performance.

# Method

There are three parts:

- Plain-old lossless compression algorithm (
*gzip*in this work) - compressor-based distance metric (Normalized Compression Distance in this work)
*k-*NN classifier

In fact, the method is so simple that it can be written in 14 lines of Python code (including import statements):

`import gzip`

import numpy as np

for (x1 , _) in test_set:

Cx1 = len(gzip.compress(x1.encode()))

distance_from_x1 = []

for (x2 , _) in training_set:

Cx2 = len(gzip.compress(x2.encode())

x1x2 = " ".join([x1 , x2])

Cx1x2 = len(gzip.compress(x1x2.encode())

ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)

distance_from_x1.append(ncd)

sorted_idx = np.argsort(np.array(distance_from_x1))

top_k_class = training_set[sorted_idx[:k], 1]

predict_class = max(set(top_k_class), key=top_k_class.count)

That’s it! How does this work?

# Theory

The main intuitions are:

- compressors → capture regularity
- objects from the same category → share more regularity

Let *C(*·*)* be the length of the compressed bitstring. If *x1* and *x2 *share the same category and *x3* is from a different category, then

*C(x1, x2) — C(x1) < C(x1, x3) — C(x1)*

since *x1* and *x2* are from the same category sharing more regularity, less number of bits are required (less entropy) to compress them compared to *x1* and *x3*. This intuition leads to compressor-based distance metrics called “*Kolmogorov complexity*” and “*information distance*”: but both of them are incomputable, and therefore the authors used a computable approximation called *Normalized Compression Distance (NCD) *instead, calculated as below.

*NCD(x, y) = (C(x, y) — min(C(x), C(y))) / max(C(x), C(y))*

Which is in line 10 of the Python code above.

# Results

## In-Domain

Below are the results in the in-domain datasets:

Key points:

*gzip*, which is a training-free, preprocessing-free approach (even modality free, as it can take any digital format as input) showed better performance than most previous baselines, except for BERT-based models.- It even surpasses SentBERT in DBPedia and R8 datasets.

## Out-of-Domain (OOD)

And below are the results in the OOD datasets. Here, OOD means that the datasets are not included in BERT pre-training. Full means full-finetuning, 5-shot is 5 labeled examples per class.

*gzip* outperforms all.

# Critics

- Only likely to well when GZIP actually compresses (compress rate < 100%)
- Short text will be difficult.
- especially, negation handling is questionable, as one word will completely change the meaning of the context, but its semantics won’t be caught by the compression algorithm.
- Not working well for images.

# Takeaways

This work suggests that we have been too focused on making deep learning work, by adding more data and increasing size, that we lost sight of traditional methods…

This also adds interesting insight to the community, as we often refer to embeddings / latent variables as “*compressions*”, whereas they literally use compression. If compressors work so well, should this work as another inductive bias toward learning representations, say, design another loss that encourages lossless compression?