Word Embedding and Language Modeling | Towards AI

How to Get Deterministic word2vec/doc2vec/paragraph Vectors

Jun Wang
Jun Wang
Nov 6, 2018 · 4 min read

OK, welcome to our Word Embedding Series. This post is the first story of the series. You may find this story is suitable for the intermediate or above, who has trained or at least tried once on word2vec, or doc2vec/paragraph vectors. But no worries, I will introduce background, prerequisites and knowledge and how the code implements it from papers in the following posts.

I will try my best to do not redirect you to some other links that ask you to read tedious tutorials and end with giving up (trust me, I am the victim of the tremendous online tutorials :) ). I want you to understand word vectors from the coding level together with me so that we can know how to design and implement our word embedding and language model.


If you got any chance to train word vectors yourself, you may find that the model and the vector representation are different across every training even you feed into the same training data. This is because of the randomness introduced in the training time. The code can talk itself, let’s take a look at where the randomness comes and how to eliminate it thoroughly. I will use DL4j’s implementation of paragraph vectors to show the code. If you want to take look on the other package, go to gensim’s doc2vec, which has the same method of implementation.

Where the randomness comes

The initialization of model weights and vector representation

We know that before training, the weights of a model and vector representation will be initialized randomly, and the randomness is controlled by seed. Hence, if we set the seed as 0, we will get the exact same initialization every time. Here is the place where the seed takes effect. Here, the syn0 is the model weights, and it is initialized by Nd4j.rand

// Nd4j takes seed configuration here
Nd4j.getRandom().setSeed(configuration.getSeed());
// Nd4j initializes a random matrix for syn0
syn0 = Nd4j.rand(new int[] {vocab.numWords(), vectorLength}, rng).subi(0.5).divi(vectorLength);

PV-DBOW algorithm

If we use the PV-DBOW algorithm (I will explain the details of it in the following posts) to train Paragraph Vectors, during the iterations of training, it randomly subsamples words from text window to calculate and update weights. But this random is not really random. Let’s take a look at the code.

// next random is an AtomicLong initialized by thread id
this.nextRandom = new AtomicLong(this.threadId);

And nextRandom is used in

trainSequence(sequence, nextRandom, alpha);

Where inside trainSequence, it will do

nextRandom.set(nextRandom.get() * 25214903917L + 11);

If we go deeper on the training steps, we will find it generates nextRandom by the same way, i.e., doing the same mathematical operation (Go to this and this to know why), so the number relies only on the thread id, where the thread id is 0, 1, 2, 3, …. Hence, it’s no longer random.

Parallel tokenization

It’s used for tokenizing parallelly since the process of complicated text can be time costing, tokenizing parallelly can help the performance, while the consistency among training is not guaranteed. The sequences processed by tokenizer can have random order to feed into threads to train. As you can see from the code, the runnable which is doing the tokenization will wait until it finishes if we set allowParallelBuilder to false, where the order of feeding data can maintain.

if (!allowParallelBuilder) {
try {
runnable.awaitDone();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException(e);
}
}

Queue that provides sequences to every thread to train

This LinkedBlockingQueue gets sequences from the iterator of training text and provides these sequences to each thread. Since every thread can come randomly, in every time of training, each thread can get different sequences to train. Let’s look at the implementation of this data provider.

// initialize a sequencer to provide data to threads
val sequencer = new AsyncSequencer(this.iterator, this.stopWords);
// each threads are pointing to the same sequencer
// worker is the number of threads we want to use
for (int x = 0; x < workers; x++) {
threads.add(x, new VectorCalculationsThread(x, ..., sequencer);
threads.get(x).start();
}
// sequencer will initialize a LinkedBlockingQueue buffer
// and maintain the size between [limitLower, limitUpper]
private final LinkedBlockingQueue<Sequence<T>> buffer;
limitLower = workers * batchSize;
limitUpper = workers * batchSize * 2;
// threads get data from the queue through
buffer.poll(3L, TimeUnit.SECONDS);

Hence, if we set the number of a worker as 1, it will run in a single thread and have the exact same order of feeding data in each time of training. But notice that single thread will tremendously slow down the training.

Summarize

To summarize, the following is what we need to do to exclude randomness thoroughly:
1. Set seed as 0;
2. Set allowParallelTokenization as false;
3. Set the number of workers (threads) as 1.

Then we will have the exact same results of word vector and paragraph vector if we feed into the same data.

Finally, our code to train is like:

ParagraphVectors vec = new ParagraphVectors.Builder()
.minWordFrequency(1)
.labels(labelsArray)
.layerSize(100)
.stopWords(new ArrayList<String>())
.windowSize(5)
.iterate(iter)
.allowParallelTokenization(false)
.workers(1)
.seed(0)
.tokenizerFactory(t)
.build();

vec.fit();

If you are feeling like

please follow the next stories about word embedding and language model, I have prepared the feast for you.

Reference

[1] Deeplearning4j, ND4J, DataVec and more — deep learning & linear algebra for Java/Scala with GPUs + Spark — From Skymind http://deeplearning4j.org https://github.com/deeplearning4j/deeplearning4j

[2] Java™ Platform, Standard Edition 8 API Specification https://docs.oracle.com/javase/8/docs/api/

[3] https://giphy.com/

[4] https://images.google.com/

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Jun Wang

Written by

Jun Wang

Machine Learning Engineer, Deep Learning Enthusiast, Msc. Machine Learning, https://www.linkedin.com/in/jun-wang-profile/

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade