Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

Compressing unsupervised fastText models

6 min readDec 14, 2021

--

Image by author: exploiting the smolness meme for fastText (generated with meme-arsenal)

How to use it?

Working with an existing model

import compress_fasttext
small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin')
print(small_model['hello'])
# [ 1.847366e-01 6.326839e-03 4.439018e-03 ... -2.884310e-02]
# a 300-dimensional numpy array
def cosine_sim(x, y):
return sum(x * y) / (sum(x**2) * sum(y**2)) ** 0.5
print(cosine_sim(small_model['cat'], small_model['cat']))
# 1.0
print(cosine_sim(small_model['cat'], small_model['dog']))
# 0.6768642734684225
print(cosine_sim(small_model['cat'], small_model['car']))
# 0.18485135055040858
print(small_model.most_similar('Python'))
# [('PHP', 0.5253), ('.NET', 0.5027), ('Java', 0.4897), ... ]
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
class FastTextTransformer(BaseEstimator, TransformerMixin):
""" Convert texts into their mean fastText vectors """
def __init__(self, model):
self.model = model
def fit(self, X, y=None):
return self
def transform(self, X):
return np.stack([
np.mean([self.model[w] for w in text.split()], 0)
for text in X
])

classifier = make_pipeline(
FastTextTransformer(model=small_model),
LogisticRegression()
).fit(
['banana', 'soup', 'burger', 'car', 'tree', 'city'],
[1, 1, 1, 0, 0, 0]
)
classifier.predict(['jet', 'cake'])
# array([0, 1])

What models are available

Image by author: screenshot of https://github.com/avidale/compress-fasttext/releases/tag/gensim-4-draft.

Compressing a model

pip install compress-fasttext[full]
from gensim.models.fasttext import load_facebook_model
big_model = load_facebook_model('path-to-original-model').wv
import gensim
big_model = gensim.models.fasttext.FastTextKeyedVectors.load('path-to-original-model')
import compress_fasttext
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
small_model.save('path-to-new-model')
small_model = compress_fasttext.prune_ft_freq(
big_model,
new_vocab_size=20_000, # number of words
new_ngrams_size=100_000, # number of character ngrams
pq=True, # use product quantization
qdim=100, # dimensionality of quantization
)

How does it work?

Related work

How fastText can be compressed?

def embed(word, model):
if word in model.vocab:
result = model.vectors_vocab[word]
else:
result = zeros()
n = 1
for ngram in get_ngrams(word, model.min_n, model.max_n):
result += model.vectors_ngrams[hash(ngram)]
n += 1
return result / n

Are the compressed models good enough?

Image by author: Evaluation of models on SentEval

Conclusion

@misc{dale_compress_fasttext, 
author = "Dale, David",
title = "Compressing unsupervised fastText models",
editor = "towardsdatascience.com",
url = "https://towardsdatascience.com/eb212e9919ca",
month = {December},
year = {2021},
note = {[Online; posted 12-December-2021]},
}

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

David Dale
David Dale

Written by David Dale

NLP researcher at FAIR, Meta. Low-resource language enthusiast. See daviddale.ru.

Responses (1)