Forte: Building Modular and Re-purposable NLP Pipelines
Authors: Hector Liu, Xin Gao and Petuum CASL Team
Natural Language Processing (NLP) is the science and engineering behind AI applications that interpret and respond to human language. Such applications can help with day-to-day problems, such as: supporting medical practitioners by highlighting key information in clinical notes, web and mobile applications that provide interactive medical advice regarding COVID and future pandemics, pre-filling clinical reports to improve operational consistency in healthcare processes, and knowledge-graph bots that build searchable “webs” of information from companies’ annual financial reports.
To apply NLP technologies to real-world applications, like a clinical report management system, one often needs to “stitch together” NLP tools, such as a Text Search System to retrieve relevant reports, a Named Entity Recognizer to identify key entities like Symptom or Time, or an Entity Relation Extractor that associate diseases with its cause and treatment. That’s because no single tool handles the full range of NLP sub-tasks. But mixing NLP tools can lead to frustration — here are some problems we’ve faced:
- Spending days or weeks harmonizing different tokenization schemes, because every NLP tool has its own schema requirements
- Trying to debug which NLP model is producing the wrong data outputs when the NLP pipeline is scattered across several scripts and code files. It could take days just to write the necessary debug code
- Planning to swap to a newer NLP model, only to find that many parts of the NLP pipeline need to be rewritten
- Applying similar NLP pipelines to different domains, e.g. healthcare and finance. But the new domains’ data formats do not work out-of-the-box with your favorite NLP tools, forcing you to insert hacks throughout your NLP pipeline. Code quality suffers, and over time you wind up having to maintain two separate codebases.
Stitching NLP tools into functioning NLP pipelines turns out to be quite hard! But, what if we could do most of the heavy lifting with one code statement, like this:
That’s a code example from Forte, a new data-centric pipeline builder from Petuum and our CASL team. Forte takes care of data interoperability between different NLP tools based on one simple idea: we simply need one central data format that works with many different tools.
To this end, Forte introduces the data pack, a carefully-designed data structure that is rich enough to represent NLP information. Think of it like this: you now have a Pandas data-frame-like object, for unstructured text. Using this data pack, Forte allows you to compose an entire NLP pipeline in a few lines of Python code, or extend that pipeline gracefully to different domains. Data packs are universal in their design, allowing you to build NLP pipelines where processors (such as your favorite NLP tools, or AI models) can be easily swapped — which encourages component re-usability, allowing NLP tools to be seamlessly integrated with custom code, and ultimately gives you greater flexibility (and less frustration) when building and repurposing NLP pipelines.
Why use Forte?
Breaks down complex problems into composable pipelines
Forte can stitch together different modules or tools to construct a composable NLP pipeline, broken down into tasks which can be solved by individual modules. For example, data readers that read from complex data formats (e.g., HTML, CSV), NLP processors (e.g., Named Entity Recognizer, Sentiment Analyzer), and other downstream consumers (e.g., visualization, serialization), etc.
Enables full interoperations across tasks through a unified data format
Forte makes shared modeling approaches possible across different NLP tasks by introducing a data ontology system that unifies input/output representations. Forte can spawn models or wrap the same toolkit on different tasks with minimal modification. For example, Forte reduces the number of lines of code by 65% and only 8 lines of code need changes to switch from training a Named Entity Recognition (NER) task to a Part-of-Speech (POS) task .
Supports easy customization for different domains
Forte helps users define new data types easily by establishing a unique inheritance strategy through its ontology system. Users can freely add custom data types to fulfill some domain-specific needs. For example, a user can add a few attributes on top of a template data type like EntityMention to introduce a new domain-specific data type like MedicalEntityMention, to store fine-grained attributes such as Unified Medical Language System (UMLS) links.
Improves debuggability with data pack
Forte helps demystify data lineage and increase the traceability of how data flows along the pipeline and how features are generated to interface data to model. Similar to a cargo ship that loads and transports goods from one port to another, a data pack carries information when passing each module and updates the ontology states along the way.
Compatible with popular 3rd party libraries
Forte wraps a rich collection of popular 3rd party NLP toolkits, including AllenNLP, HuggingFace , SpaCy, NLTK, Stanza, etc., so that users can call the libraries with Forte directly. Forte also provides easy-to-use readers for common NLP data inputs, including HTML, plain text and CoNLL. Forte aims to provide a one-stop shop where users can find all useful off-the-shelf tools. With the aid of Forte, users can be freed from cumbersome bookkeeping jobs to focus on more important tasks.
You can start using the Forte library by:
- Installing it through PyPI
$ pip install forte
- Or you can install the bleeding-edge version
$ git clone https://github.com/asyml/forte.git
$ cd forte && pip install
Build a Clinical Note Analyzer
We’ll use an example to walk you through how to build a clinical note analysis pipeline with Forte. Building such a pipeline involves three main steps (refer to Figure 1), including:
- Load HTML pages as Data Packs with Reader
- Build the data pack processors
- Assemble the pipeline and see it running
1. Start with the Reader
Next, we’ll write an HTML Reader, the starting point for building a workflow. Reader is used to ingests data of any format from external sources, parse information and converts into data pack. For example, we can use the HTMLReader provided by Forte that can clean up the HTML tags from web pages.
2. Build the Processor
A pipeline usually consists of multiple processors and as data flows along the pipeline, then each processor can use or add information in the data pack. Below is a sketch implementation of our Medical Entity Detector.
3. See the Pipeline Running in Action
After building the components we need, it is effortless to create the pipeline. In the snippet below, we piece together a couple of ready-to-use modules such as Tokenizer, NER, Entity Linker, Relation Extractor.
Now the pipeline is ready, we can see it run in action:
The resulting data pack contains the processed NLP results, along with the original text. Visually, it looks like the following example:
- Raw text: “she does not have SVS syndrome from an axillary vein thrombosis”
- Parent: SVC syndrome;
- Child: axillary vein thrombosis;
- Relation Type: CAUSED_BY
Most of the data can be obtained by the simple `data_pack.get` function, allowing one to conveniently browse the data pack.
Try out more Examples
Visit our GitHub page for more examples:
Let’s keep building!
We’ll be continuously enriching Forte’s gallery of useful modules at different levels, while integrating Forte with trending off-the-shelf libraries and training frameworks. In parallel, we are developing an extensible annotation and visualization tool called Stave with an intuitive interface to enhance human-machine interaction and root “keep human in the loop” in our design philosophy. In addition, Forte will support learning from multi-modal data such as image, audio, etc., and further improve module-level evaluation and validation.
We are excited for the community to try it out and share your feedback or new ideas. We’re looking forward to continuing and deepening our collaboration with the CASL community!
CASL provides a unified toolkit for composable, automatic, and scalable machine learning systems, including distributed training, resource-adaptive scheduling, hyperparameter tuning, and compositional model construction. CASL consists of many powerful Open-source components that were built to work in unison or leveraged as individual components for specific tasks to provide flexibility and ease of use.
Thanks for reading! Please visit the CASL website to stay up to date on additional CASL and Forte announcements soon: https://www.casl-project.ai. If you’re interested in working professionally on CASL, visit our careers page at Petuum!