# How to Get Deterministic word2vec/doc2vec/paragraph Vectors

OK, welcome to our Word Embedding Series. This post is the first story of the series. You may find this story is suitable for the intermediate or above, who has trained or at least tried once on word2vec, or doc2vec/paragraph vectors. But no worries, I will introduce background, prerequisites and knowledge and how the code implements it from papers in the following posts.

I will try my best to do not redirect you to some other links that ask you to read tedious tutorials and end with giving up (trust me, I am the victim of the tremendous online tutorials :) ). I want you to understand word vectors from the coding level together with me so that we can know how to design and implement our word embedding and language model.

If you got any chance to train word vectors yourself, you may find that the model and the vector representation are different across every training even you feed into the same training data. This is because of the randomness introduced in the training time. The code can talk itself, let’s take a look at where the randomness comes and how to eliminate it thoroughly. I will use DL4j’s implementation of paragraph vectors to show the code. If you want to take look on the other package, go to gensim’s doc2vec, which has the same method of implementation.

# Where the randomness comes

## The initialization of model weights and vector representation

We know that before training, the weights of a model and vector representation will be initialized randomly, and the randomness is controlled by seed. Hence, if we set the seed as 0, we will get the exact same initialization every time. Here is the place where the seed takes effect. Here, the `syn0` is the model weights, and it is initialized by `Nd4j.rand`

`// Nd4j takes seed configuration hereNd4j.getRandom().setSeed(configuration.getSeed());// Nd4j initializes a random matrix for syn0syn0 = Nd4j.rand(new int[] {vocab.numWords(), vectorLength}, rng).subi(0.5).divi(vectorLength);`

## PV-DBOW algorithm

If we use the PV-DBOW algorithm (I will explain the details of it in the following posts) to train Paragraph Vectors, during the iterations of training, it randomly subsamples words from text window to calculate and update weights. But this random is not really random. Let’s take a look at the code.

`// next random is an AtomicLong initialized by thread idthis.nextRandom = new AtomicLong(this.threadId);`

And `nextRandom` is used in

`trainSequence(sequence, nextRandom, alpha);`

Where inside `trainSequence`, it will do

`nextRandom.set(nextRandom.get() * 25214903917L + 11);`

If we go deeper on the training steps, we will find it generates `nextRandom` by the same way, i.e., doing the same mathematical operation (Go to this and this to know why), so the number relies only on the thread id, where the thread id is 0, 1, 2, 3, …. Hence, it’s no longer random.

## Parallel tokenization

It’s used for tokenizing parallelly since the process of complicated text can be time costing, tokenizing parallelly can help the performance, while the consistency among training is not guaranteed. The sequences processed by tokenizer can have random order to feed into threads to train. As you can see from the code, the `runnable` which is doing the tokenization will wait until it finishes if we set `allowParallelBuilder` to false, where the order of feeding data can maintain.

`if (!allowParallelBuilder) {    try {        runnable.awaitDone();    } catch (InterruptedException e) {        Thread.currentThread().interrupt();        throw new RuntimeException(e);    }}`

## Queue that provides sequences to every thread to train

This LinkedBlockingQueue gets sequences from the iterator of training text and provides these sequences to each thread. Since every thread can come randomly, in every time of training, each thread can get different sequences to train. Let’s look at the implementation of this data provider.

`// initialize a sequencer to provide data to threadsval sequencer = new AsyncSequencer(this.iterator, this.stopWords);// each threads are pointing to the same sequencer // worker is the number of threads we want to usefor (int x = 0; x < workers; x++) {    threads.add(x, new VectorCalculationsThread(x, ..., sequencer);                    threads.get(x).start();            }// sequencer will initialize a LinkedBlockingQueue buffer// and maintain the size between [limitLower, limitUpper]private final LinkedBlockingQueue<Sequence<T>> buffer;limitLower = workers * batchSize;limitUpper = workers * batchSize * 2;// threads get data from the queue throughbuffer.poll(3L, TimeUnit.SECONDS);`

Hence, if we set the number of a worker as 1, it will run in a single thread and have the exact same order of feeding data in each time of training. But notice that single thread will tremendously slow down the training.

## Summarize

To summarize, the following is what we need to do to exclude randomness thoroughly:
1. Set seed as 0;
2. Set allowParallelTokenization as false;
3. Set the number of workers (threads) as 1.

Then we will have the exact same results of word vector and paragraph vector if we feed into the same data.

Finally, our code to train is like:

`ParagraphVectors vec = new ParagraphVectors.Builder()                .minWordFrequency(1)                .labels(labelsArray)                .layerSize(100)                .stopWords(new ArrayList<String>())                .windowSize(5)                .iterate(iter)                .allowParallelTokenization(false)                .workers(1)                .seed(0)                .tokenizerFactory(t)                .build();vec.fit();`

If you are feeling like

please follow the next stories about word embedding and language model, I have prepared the feast for you.

## Reference

 Deeplearning4j, ND4J, DataVec and more — deep learning & linear algebra for Java/Scala with GPUs + Spark — From Skymind http://deeplearning4j.org https://github.com/deeplearning4j/deeplearning4j

 Java™ Platform, Standard Edition 8 API Specification https://docs.oracle.com/javase/8/docs/api/

Written by