Really great, I have some questions,
Somnath Kadam
1

  1. Trained on a 64GB Amazon instance on a single core. The max memory I saw the C training process using was about 18GB, but it’s the same machine used to generate the data, which requires more memory to cache the polygons. GPUs and specialized hardware are not needed for this type of model, only really useful with matrix-matrix multiplications common in neural networks.
  2. Previous tag is what the model predicted. In the greedy averaged perceptron, we take the model’s prediction as given, so essentially you just call the feature function once per token in the string, predict a label for the current token using the current estimate of the weights, and pass it in to the next call (that feature is turned off for the first token since there was no previous token). For a simple Python implementation of the averaged perceptron, see the post I reference above: https://explosion.ai/blog/part-of-speech-pos-tagger-in-python. In the CRF model, we’re actually computing an LxL matrix of scores at each token where L is the number of distinct labels. My solution in this case was to define two classes of weights. The local features are stored in an NxL sparse matrix (double hash table during training) where N is the number of features. For the previous tag features, we do the exact same thing except it’s and Nx(LxL) matrix. The weights can still be stored sparsely because we only need to update weights when there’s a mistake, and even then we only need to update two entries in the row, one for the predicted class and one for the true class. The difference is that instead of the class label being simply e.g. “tag=road”, the class labels are now e.g. “tag=road and prev_tag=house_number”.
  3. Our CRF implementation would need to be trained from scratch if adding new samples, mostly because there are certain data structures needed during training that are discarded and/or compacted for the runtime model. For training with averaged perceptrons, weight averaging (a form of regularization for the perceptron) and the effects of averaging on incremental training have not been studied. CRFs can also be trained with an optimizer more amenable to incremental training like stochastic gradient descent.
Show your support

Clapping shows how much you appreciated Al Barrentine’s story.