Multi-Task Learning With TF.Keras

Published in

The Startup

3 min readJul 14, 2020

One year ago I’ve posted an article showing how to build trivial sentence breaker and tokenizer in Java with DeepLearning4J. Recently, I’ve got a need to build similar model in Python, so I’ve decided to write a follow-up post, that’ll show DeepLearning4J -> Keras migration, explain some of the “DL4J vs Keras” nuances and highlight some of the issues Keras has at this moment.

Primary goal for this small project is to get model that is able to segment raw Russian/Ukrainian text into: independent sentences and independent tokens. On top of that, each individual token must get Part of Speech attribution. So, by definition it’s a multi-task learning problem: raw text as input and 3 target functions. In terms of architecture it will be a pretty trivial recurrent model, Many-to-Many setup.

Preparing the data:

Keras expects input in NTS format - [examples, timesteps, features]. Building one-hot dataset will be fairly trivial in this case: for each example in batch I’ll pick number of sentences, glue them together, and split into individual characters. These characters will be one-hot encoded and resulting tensor will be fed into neural network.

Dataset creation

So, I have 2 inputs: LTR reading,RTL reading, and exactly 3 outputs: Tokenization, Sentence breaker and Parts of Speech.

Class imbalance:

Due to the design I have, there will be certain class imbalance: parts of speech distribution across the typical text isn’t uniform. Sentence breaks are even more rare events: you have exactly 1 sentence start/sentence end pair per sentence. And lots of characters between them. It’s not really possible to solve this problem with augmentation or by manually balancing datasets (since that’s natural imbalance) I had do use per-class weights.

Passing class weights into Model.fit()

There’s a problem with this approach though: class_weight option is incompatible with tf.distribute.MirroredStrategy.

NotImplementedError: `class_weight` is currently not supported when using tf.distribute.Strategy.

In my case it’s not a game stopper though, since model is pretty small and I can definitely live with single GPU training, but if someone is going to train big networks on huge datasets and is going to rely on distributed training — that’s something to keep in mind.

Porting masking:

In DeepLearning4J input/output masks are defined explicitly, and provided at training time as a part of DataSet/MultiDataSet. In Keras you have couple of options available: you can use Masking layer — this layer “masks” timesteps that contain only mask_value thus excluding such timesteps from gradients calculation and propagating this mask through the network. Other way would be providing mask tensor explicitly as input, and use multiplication somewhere within your model graph in order to nullify gradients of unused timesteps.

Since I’m using casual one-hot encoding, it perfectly justifies usage of Masking layer, with only one nuance: I’d also like to use Bidirectional layer to make classifications based on both LTR and RTL readings. Since we’re dealing with variable-length timeseries masks, it’s easier to to provide model with manually reversed text as second input, rather than working around issues that will arise in Keras with Bidirectional layer. Mathematically, amount of information provided to the model in this way will be equal to Bidirectional layer approach, and will cause exactly zero issues.

Final model code:

Final model code

After some time spent training, the model will be able to generate outputs that look like this:

BRK: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 2]
POS: [ 0  0  0  0  0  0  0  0 17  7  7  7  7  7 11]
TOK: [0 1 1 1 1 1 1 2 4 0 1 1 1 1 3]

This allows to split raw input text into sentences and tokens, plus attribute each token with its PoS. Which is exactly what I’ve been looking for.

Thanks for reading. Feel free to reach me out, if you have any questions :)

Multi-Task Learning With TF.Keras

Written by raver119