Simple Python Downloader for Available Word Embeddings

Hiroki Nakayama
chakki
Published in
2 min readMay 30, 2017

In natural language processing, word embeddings are often used for many tasks such as document classification, named-entity recognition, question answering and so on. In these days, there are many available pre-trained word embeddings, so we don’t need to train them by ourselves.

Pre-trained word embedding is easy to use, however, it takes long time to search and download pre-trained word embeddings because they are made by different people and published on different sites. It’s a waste of time.

In order to save your time, I made a simple tool to download available word embeddings. The name is chakin. The features are: written in Python, enabled search and download datasets, supported 23 vectors(May 29, 2017).

Let me show you how to use it.

Installation

To install chakin, simply:

$ pip install chakin

Satisfaction, guaranteed.

Usage

You need only three line to download a dataset. As an example, let’s download fastText(English ver), one of the word embeddings. First, you have to run python interpreter:

$ python

Before downloading the dataset, you have to import chakin and search word embeddings by search method. In this case, we will search datasets by their language:

>>> import chakin
>>> chakin.search(lang="English")
Name Dimension Corpus VocabularySize2 fastText(en) 300 Wikipedia 2.5M11 GloVe.6B.50d 50 Wikipedia+Gigaword 5 (6B) 400K12 GloVe.6B.100d 100 Wikipedia+Gigaword 5 (6B) 400K...

Currently, search method supports only target languages.

Once you find the dataset you want to download, you can download it by calling download method with the dataset index:

>>> chakin.download(number=2, save_dir='./')
Test: 100% || | Time: 0:18:32 6.7 MiB/s
'./wiki.en.vec'

Conclusion

Public word embeddings are often used in natural language processing. But it can take long time to train word embeddings by yourself. In this post, I introduced a tool to download pre-trained word embeddings. It is useful for you to save your time.

If you Star this repository, It’s very encouraging for me!

--

--

Hiroki Nakayama
chakki
Editor for

Open source developer. Interested in machine learning and natural language processing. GitHub: https://github.com/Hironsan