Sentence Extraction with Custom Trained NLP Models
Breaking natural language text into individual sentences.
Introducing Sentence Extraction
A common required task of natural language processing (NLP) is to extract sentences from natural language text. This can be a task on its own or as part of a larger NLP system. There are several ways to go about doing sentence extraction. There is the naive way of splitting based on the presence of periods. That works great until you remember that periods don’t always indicate a sentence break. There are tools that break text into sentences based on rules. There is actually a standard for communicating these rules called Segmentation Rules eXchange, or, SRX. These rules often work with very good success, however, they are language dependent. Additionally, implementing code for these rules can be difficult because not all programming languages have the necessary constructs.
Model-Based Sentence Extraction
This brings us to model-based sentence extraction. In this approach we use trained models to identify sentence boundaries in natural language text. In summary, we take training text, run it through a training process, and we get a model that can be used to extract sentences. A significant benefit of model-based sentence extraction is that you can adapt your model to represent the actual text you will be processing. This leads to potentially great performance. Our NLP Building Block product called Prose Sentence Extraction Engine uses this model-based approach.
Training a Custom Sentence Model with Prose Sentence Extraction Engine
Prose Sentence Extraction Engine 1.1.0 introduced the ability to create custom models for extracting sentences from natural language text. Using a custom model typically provides a much greater level of accuracy than relying on the internal Prose logic to extract sentences. Creating a custom model is fairly simple and this blog post demonstrates how to do it.
To get started we are going to launch Prose Sentence Extraction Engine via the AWS Marketplace. The benefit of doing this is in just a few seconds (okay, maybe 30 seconds) we will have an instance of Prose fully configured and ready to go. Once the instance is up and running in EC2 we can SSH into it. (Note that the SSH username is ec2-user.) All commands presented in this post are executed through SSH on the Prose instance.
SSH to the Prose instance on EC2:
ssh -i key.pem email@example.com
Once connected, change to the Prose directory:
Training a sentence extraction model requires training text. This text needs to be formatted in a certain way — one sentence per line. This is how Prose learns how to recognize a sentence for any given language. We have some training text for you to use for this example. When creating a model for your production use you should use text representative of the real text that you will be processing. This gives the best performance.
Download the example training text to the instance:
wget https://s3.amazonaws.com/mtnfog-public/a-christmas-carol-sentences.txt -O /tmp/a-christmas-carol-sentences.txt
Take a look at the first few lines of the file you just downloaded. You will see that it is a sentence per line. This file is also attached to this blog post and can be downloaded at the bottom of this post.
Now, edit the example training definition file:
sudo nano example-training-definition-template.xml
You want to modify the trainingdata file to be “/tmp/a-christmas-carol-sentences.txt” and set the output model file as shown below:
<?xml version="1.0" encoding="UTF-8"?>
<trainingdata file="/tmp/a-christmas-carol-sentences.txt" format="opennlp"/>
<model name="sentence" file="/tmp/sentence.bin" encryptionkey="random" language="eng" type="sentence"/>
This training definition says we are creating a sentence model for English (eng) text. The trainined model file will be written to /tmp/sentence.bin. Now, we are ready to train the model:
You will see some output quickly scroll by. Since the input text is rather small, the training only takes at most a few seconds. Your output should look similar to:
$ ./bin/train-model.sh example-training-definition-template.xml
Prose Sentence Model Generator
Beginning training using definition file: /opt/prose/example-training-definition-template.xml
2017-12-31 19:21:03,451 DEBUG [main] models.ModelOperationsUtils (ModelOperationsUtils.java:40) - Using OpenNLP data format.
2017-12-31 19:21:03,567 INFO [main] training.SentenceModelOperations (SentenceModelOperations.java:281) - Beginning sentence model training. Output model will be: /tmp/sentence.bin
Indexing events with TwoPass using cutoff of 0
Computing event counts... done. 1990 events
Collecting events... Done indexing in 0.41 s.
Incorporating indexed data for training...
Number of Event Tokens: 1990
Number of Outcomes: 2
Number of Predicates: 2274
Computing model parameters...
Performing 100 iterations.
1: . (1827/1990) 0.9180904522613065
2: . (1882/1990) 0.9457286432160804
3: . (1910/1990) 0.9597989949748744
4: . (1915/1990) 0.9623115577889447
5: . (1940/1990) 0.9748743718592965
6: . (1950/1990) 0.9798994974874372
7: . (1953/1990) 0.9814070351758793
8: . (1948/1990) 0.978894472361809
9: . (1962/1990) 0.985929648241206
10: . (1954/1990) 0.9819095477386934
20: . (1979/1990) 0.9944723618090452
30: . (1986/1990) 0.9979899497487437
40: . (1990/1990) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1990/1990) 1.0
Compressed 2274 parameters to 707
1 outcome patterns
2017-12-31 19:21:04,491 INFO [main] manifest.ModelManifestUtils (ModelManifestUtils.java:108) - Removing existing manifest file /tmp/sentence.bin.manifest.
Sentence model generated complete. Summary:
Model file : /tmp/sentence.bin
Manifest file : sentence.bin.manifest
Time Taken : 1056 ms
Our model has been created and we can now use it. First, let’s stop Prose in case it is running:
sudo service prose stop
Next, copy the model file and its manifest file to /opt/prose/models:
sudo cp /tmp/sentence.* /opt/prose/models/
Since we moved the model file, let’s also update the model’s file name in the manifest file:
sudo nano models/sentence.bin.manifest
Change the model.filename property to be sentence.bin (remove the /tmp/). The manifest should now look like:
Now, with our models in place, we can now start Prose. If we tail Prose’s log while loading we can see that it finds and loads our custom model:
sudo service prose start && tail -f /var/log/prose.log
In case you are curious, the lines in the log that show the model was loaded will look similar to these:
[INFO ] 2017-12-31 19:25:57.933 [main] ModelManifestUtils - Found model manifest ./models//sentence.bin.manifest.
[INFO ] 2017-12-31 19:25:57.939 [main] ModelManifestUtils - Validating model manifest ./models//sentence.bin.manifest.
[WARN ] 2017-12-31 19:25:57.942 [main] ModelManifestUtils - The license.key in ./models//sentence.bin.manifest is missing.
[INFO ] 2017-12-31 19:25:58.130 [main] ModelManifestUtils - Entity Class: sentence, Model File Name: sentence.bin, Language Code: en, License Key:
[INFO ] 2017-12-31 19:25:58.135 [main] DefaultSentenceDetectionService - Found 1 models to load.
[INFO ] 2017-12-31 19:25:58.138 [main] LocalModelLoader - Using local model loader directory ./models/
[INFO ] 2017-12-31 19:25:58.560 [main] ModelLoader - Model validation successful.
[INFO ] 2017-12-31 19:25:58.569 [main] DefaultSentenceDetectionService - Found sentence model for language eng
Yay! This means that Prose has started and loaded our model. Requests to Prose to extract sentences for English text will now use our model. Let’s try it:
curl http://ec2-54-174-13-245.compute-1.amazonaws.com:8060/api/sentences -d "This is a sentence. This is another sentence. This is also a sentence." -H "Content-type: text/plain"
The response we receive from Prose is:
["This is a sentence.","This is another sentence.","This is also a sentence."]
Our sentence model worked! Prose successfully took in the natural language English text and sent us back three sentences that made up the text.
Prose Sentence Extraction Engine is available on the AWS Marketplace, Azure Marketplace, an Dockerhub. You can launch Prose Sentence Extraction Engine on any of those platforms in just a few seconds.
At the time of publishing, Prose 1.1.0 was in-process of being published to the Azure and AWS Marketplaces. If 1.1.0 is not yet available on those marketplaces it will be in just a few days once the update has been published.