Structuring unstructured documents

Nested Bi-LSTM for Supervised Document Parsing

A Deep Learning based supervised approach for text segmentation

Wacim Belblidia
Illuin

--

Parsing a document consists in splitting its text into several blocks of semantic consistency. This classical NLP task is often referred as text segmentation.

Parsing a document in order to extract structured information is a key challenge for many companies, that have either a constant document stream (insurances companies, health administrations), or tremendous amounts of legacy document bases (lawyers teams, consultancy firms, etc.).

This article walks through the implementation and training of a Deep Learning model that achieves document parsing on Wikipedia articles.

Several approaches

Various methods can be imagined to parse a text into blocks.

Rule based

Rules based methods consist in manually setting up strategies to look for keywords, seek some numbering along with titles, extract styles to identify section headers, or look for patterns in the document organisation. These simple rules can achieve satisfactory performances on documents that share a common formatting structure.

Content-based

As a human would do, the natural mindset for parsing a document is to read its content and detect when the content changes. Detecting semantic ruptures is very powerful, but requires to understand the text meaning.

Such understanding can be achieved with Machine Learning algorithms such as LDA (Latent Dirichlet Allocation) for topic modelling, TF-IDF (Term Frequency, Inverse Document Frequency) or Word Embeddings methods, that both build vectorized text representations.

A simple way to distinguish the blocks would then be to train a clustering algorithm such as K-Means or DBSCAN, and identify meaningful clusters of sentences.

Supervised text segmentation

In this approach, an algorithm is trained in a supervised mindset to detect separations in a document. The document is walked through sentence by sentence, and for each sentence the model predicts whether this piece of text starts a new block or not. This is the approach we selected.

Very few annotated corpus are available in the open-source community. By using a semi-supervised learning paradigm, knowledge learnt on generic datasets can hopefully be transferred towards domain-specific corpuses, eventually with a very small of annotated documents.

Our approach

We implemented a supervised text segmentation approach proposed in March 2018 by a team of the Tel-Aviv University in Koshorek et. al., Text Segmentation as a Supervised Task.

Data

This approach uses the english Wikipedia corpus. Title headers of top-level sections are used to define the block boundaries. The first sentence of each section is labelled as a 1, and other sentences are labelled as 0. Of course, the headers’ text are removed, so that we only focus on the semantic ruptures from one paragraph to the next.

The learning target for each article is then a sequence of 0s and 1s. The research team released already processed Wikipedia corpuses containing the articles with highlighted block separations. The full dataset, Wiki727k, includes 727k Wikipedia articles. For computation matters, we'll use only 10% of this huge dataset.

As a first approximation, too long documents are not processed, as they bring both convergence and memory challenges. We used the 95% quantile value as an upper limit for article lengths, so that every article longer than 150 sentences is not processed.

The corpus is randomly split into a training (80%) and a validation (20%) sets. The complementary dataset Wiki50, containing only 50 articles, will be used as a test set.

Preprocessing

Classical preprocessing steps are applied :

  • The text is lowercased for creating the embedding representations.
  • Tokenization. Two tokenizations processes are applied : tokenizing articles into sentences, and sentences into words. We used WordPunktTokenizer and SentencePunktTokenizer from NLTK.
  • Digit characters are replaced by the special token “DIGIT”. For instance, the year “2019” will be replaced by “DIGIT DIGIT DIGIT DIGIT”. This way, we extract only the formatting information from these tokens, not the value.

For instance, this Wikipedia article is represented as follows:

Representation of one training sample as an aligned sequence of sentences and labels.

Vocabulary management

As every Deep Leaning NLP applications, words are represented by their ids in a vocabulary, and then projected as continuous vectors in an embedding matrix. Correctly managing the size of our vocabulary is critical, as it directly impacts the number of trainable weights. Training the word embedding is a major part of the network’s convergence effort. A too large vocabulary might lead to poor convergence and most likely to overfitting.

Too keep a small enough vocabulary, we used the Glove embedding, that offers a pre-trained english word embedding space of dimension 50 (against 300 for Word2Vec or FastText embeddings).

In order to take into account the dataset specifics, we built a hybrid vocabulary between the GloVe pre-trained embedding and the corpus words. To do so, the vocabulary is built by combining the top 30k words of Glove with the top 5k from the dataset words (excluding the ones already in Glove).

The same approach has been adopted in another ILLUIN Technology research work, detailed below.

Model

To solve this task, we trained a Bi-LSTM based neural network that learns a sentence embedding and predicts sentence-wise probabilities of starting a new block.

Architecture

Each sentence is projected into a sentence embedding, and then processed by a LSTM layer. The sentence embedding is learnt from the word-level embeddings : a first recurrent network processes every word of the input sentence through two Bi-LSTM layers. The output sequence is max-pooled to get rid of the sentence length dependency. This way, sentence embedding vectors are of fixed size.

Then, the computed sentence embedding is passed as an input to a second recurrent neural network: Two Bi-LSTM layers process the input vectors. The final LSTM cell states are used to feed a final classification layer (Sigmoid activation) that predicts the binary label at each time step.

Embedding and LSTM dimensions are set up to 50.

Nested Bi-LSTM for text segmentation

This architecture has been implemented in Keras. An overview of the network implementation is shown below.

Training

We run the training with the following hyper parameters:

  • Batch size : 16
  • Learning rate : 0.002
  • Optimizer : Adam
  • Number of iterations : 56,000

The network is trained by minimizing an element-wise binary cross-entropy objective function :

Loss function : averaged binary cross-entropy, starting from sentence nº2. Yi : ground truth, Pi: predicted probability

Each timestep contributes through its negative log-likelihood. The first sentence is always labeled as 1 because it starts the first block, so the first timestep is not taken into account in the loss function. The loss is averaged, to get rid of the sentence length dependency.

Batch generation also contains some specifics, as it requires a double padding process. Indeed, a batch contains b articles, each of them containing a variable number of sentences n_sentences. Each sentence contains a variable number of words n_words. Hence, a batch is represented by a tensor of shape (b, max_n_sentences, max_n_words), where max_n_sentences is the highest number of sentences contained in one article (among the b articles), and max_n_words is the longest sentence (token-wise) among all the sentences of the b articles.

Padding tokens are used to fill the tensors blanks, and ignored by the loss function during back-propagation.

Results

Network convergence

The convergence curves can be observed below. The accuracy metric doesn't make any sense for this task, as most labels are 0. Instead, let's dive into the loss / precision / recall / f1 metrics for training and validation phases, to get a glance on the network performances.

Overfitting occurs after epochs 70–80. Our final model is selected with an early stopping process. The final model can be picked following two main options:

  • Selecting the epoch at which the loss value is minimal
  • Selecting the epoch where the loss value is in a minimum interval of, let's say, min_loss +/- 10%, while maximizing the validation F1 score. Indeed, the convergence curves show that for a large epoch range (epochs 30 to 70), the loss stays quite close to its minimal value while the f1 still raises.

Moreover, in order to assess the model performance, the classification threshold has to be tuned, as its value strongly influences the metric values. In order to choose the right threshold, we will use the Pk value (defined below).

Metrics assessment

The Pk score is mentioned in the paper as the main metric for text segmentation tasks.

Pk is the probability that when passing a sliding window of size k over sentences, the sentences at the boundaries of the window will be incorrectly classified as belonging to the same segment (or vice versa).

Following (Glavas et al. , 2016), we set k to half of the average segment size in the ground-truth segmentation.

We will use it to compare our model’s performance with the reference implementation, and also to set the classification threshold.

The lower the Pk value is, the better.

To tune the classification threshold, we computed Pk on the validation set for various values of the threshold. Results are shown below.

Pk score for different classification threshold.

0.3 seems like an optimal threshold value for this binary classification task.

The final performance is assessed on the Wiki50 dataset, used as an evaluation set in the paper.

Our trained implementation reached a very satisfactory Pk score of 24,10% on the Wiki50 test set. This score is lower than the exposed state-of-the-art (18,24%), but still shows a massive improvement over the random baseline (52,65%) and the GraphSeg model (63,56%) (both mentioned in the paper).

Several factors explain this performance difference :

  • We trained only on a small part of the dataset (~10%)
  • Using a Glove embedding instead of Word2Vec (because it's available in a smaller dimension)
  • Our embedding method differs : we tried using a custom method for merging the pre-trained and the custom dataset vocabulary in order to properly manage OOV words.
  • Small implementation details that might differ (Batch Normalization, Dropout, etc.)

More research works are underway to implement a full pre-trained embedding method with a bigger training infrastructure.

Moreover, our trained model performs a F1 score of 61,15% on the Wiki50 dataset.

Segmentation examples

Let’s get a concrete touch on the model’s performance. The following examples show the origin article, the parsed version of the text, and a macro segmentation view, for a couple of random articles from the Wiki50 dataset. We picked a wide range of article sizes and themes.

On the macro visualisations, the ground truth article separators are shown in orange, and the predicted separators in blue. The chosen threshold (0.3) is also displayed.

Example 1 : Beer Checkers game article

https://en.wikipedia.org/wiki/Beer_checkers

Parsed text

Parsing visualization. The ground truth separators are perfectly predicted with pretty good confidences.

Example 2: Jason Joseph Euell article

https://en.wikipedia.org/wiki/Jason_Euell
[Note: The current version of the article is slightly different. It might have changed after the Wiki50 dataset creation]

Parsed version

Parsing visualization (much longer sentence). The ground truth separators are mostly all predicted (7/10), as the model lacks confidence. 2 false positives are also predicted.

Example 3 : Microsoft BASIC

https://en.wikipedia.org/wiki/Microsoft_BASIC

Parsed text

Parsing visualization.

Wrap up & Next steps

We walked through the main steps of our research work about supervised text segmentation inspired from a 2018 paper from the Tel-Aviv University. Our implementation slightly differs from the paper's by its dataset size and its embedding process. We still achieved a good performance of 24,10% Pk score, heading close to the 18,24% SOTA.

More research are underway to train our models with a full dataset and similar implementation choices.

Next steps are to go further in the text segmentation capability by implementing more advanced networks, including for instance attention mechanisms or pointer generator modules, such as the SegBot model from NTU.

References

Similar research works are underway at ILLUIN Technology. Some of them are already shared on our blog page.

Our team is constantly growing. Interested? We’re hiring!
https://www.illuin.tech

--

--