Training a Word2Vec model for Swedish

Read a blog post at http://mfcabrera.com/research/2013/11/14/word2vec-german.blog.org/ and it was well timed, since I need a word2vec model for various applications, but for the Swedish language, so I decided to do a small write up, covering this.

I will use Deeplearning4j in Java to train and test my model.

Downloading and cleaning the data

First get the Swedish wiki dump (or another language you might be interested in) on https://dumps.wikimedia.org/svwiki/ (pick the date for the last complete dump).

After you’ve downloaded it, it is time to clean the data. I’ve used the perl script that Matt Mahoney wrote which mfcabrera refers to in his blog, I put it on Github to make it easier for people to get it, https://github.com/klintan/wikidump-xml-clean. For swedish you might want to change the script to handle swedish characters (by handle, I mean replacing) using this Gist https://gist.github.com/mfcabrera/7674065 ,which is written by mfcabrera.

Effectively adding:

tr/A-Z/a-z/;
tr/ÅÄÖ/åäö/; # convert upper letter umlaut to normal characters
s/å/aa/g;
s/ä/ae/g; # change characters to test
s/ö/oe/g;

To run the perl script and clean the wikidump run the following command:

perl wikifil.pl wikidump > wikitext.txt

Now you have a raw text file to train your word2vec model.

Training time!

Start a new project in Intellij or your IDE of choice. I use Maven for building, so first I put the dependencies in the pom.xml

<properties>
<nd4j.version>0.0.3.5.5.3</nd4j.version>
<dl4j.version>0.0.3.3.3.alpha1</dl4j.version>
</properties>
<dependencies>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-ui</artifactId>
<version>${dl4j.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-nlp</artifactId>
<version>${dl4j.version}</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-jblas</artifactId>
<version>${nd4j.version}</version>
</dependency>
</dependencies>

Then create a Java class with some name and use the following code to train your word2vec model

public static void main(String[] args) throws Exception {

File file = new File("wikitext.txt");
SentenceIterator iter = new FileSentenceIterator(file);

TokenizerFactory t = new DefaultTokenizerFactory();

int layerSize = 300;

Word2Vec vec = new Word2Vec.Builder().sampling(1e-5) .minWordFrequency(5).batchSize(1000).useAdaGrad(false).layerSize(layerSize).iterations(3).learningRate(0.025).minLearningRate(1e-2).negativeSample(10).iterate(iter).tokenizerFactory(t).build();
vec.fit();

Pretty straighforward, first put your full path and file name in the “new File” constructor call. Create a file object from your file, and using the file object, create a FileSentenceIterator object, a sentence iterator that knows how to iterate over a sentence. Create a tokenizer factory to tokenize each sentence before the training, word2vec trains on a per word basis.

Further create the Word2vec model object, using Word2Vec builder. You can always tweak the parameters but I used these (could probably be optimized much more, lots of resources out there).

This will take some time, might be several hours depending on your machine (as in 10–18 hours).

To make sure you don’t have to train it again, save the model

Nd4j.ENFORCE_NUMERICAL_STABILITY = true;

SerializationUtils.saveObject(vec,new File("w2v_model.ser"));
WordVectorSerializer.writeWordVectors(vec,"w2v_vectors.txt");

Here it is saved both in serialized object format and raw text format, where you can see each vector representation in plain text.

Testing it

Now for the fun part! Testing the model for some results:

double sim = vec.similarity("Pannkaka", "Graedde");
System.out.println("Similarity between Pannkaka and Graedde " + sim);
Collection<String> similar = vec.wordsNearest("Pannkaka", 20);
System.out.println(similar);

In this example for “Grädde” we’ve replace “ä” with ae.

Where using “vec.similarity” will get you the distance between the words, and “vec.wordsNearest” will get you the 20 closest words to your word.

That is all! Please feel free to post some of your results in the comment section.

PS. I also get to do NLP and deeplearning at Meltwater where I work in the data science team.