Legal, do you speak it?

Orchestrating legal NLP services for a portfolio of use cases

Lynx Service Platform architecture

Artem Revenko
Semantic Tech Hotspot

--

This work would not be possible without our great Lynx team. And special thanks to Filippo Maganza who provided invaluable input in the preparations of this post.

In this post, I focus on general usage scenarios and how we enable them from an architectural point of view. But the Lynx project is much more than that! The individual services are state-of-the-art models customized to the legal domain. Every use case implements a unique solution to the below-mentioned challenges as a web application. We are glad to tell you more! We will have a series of webinars 📺 to cover all the different aspects of Lynx. You can also get updates through our newsletters ️✉️ . And if you want more details have a look at our publications 📜. And check out the bonus video at the end of this post.

The legal challenges pop up across many different industries and applications. Emerging areas that require complex engineering solutions would often lack thorough legislation. Moreover, the previous successful practices may be scattered in different storages. Often industry-wide standards and even terminology are not yet existing. How can we help these emerging areas?

  1. Faster converge on common standards and terminologies?
  2. Enable agents to identify relevant best practices and previously successful projects?
  3. Identify the potential risks at earlier stages of project preparations?

Speaking with legal consultancy experts we identify the problem of legal search. Full-text search is not always sufficient to find relevant articles of law, and web search engines are not tuned for this task, the legal “language” confuses them. As a result, consultants spend significant amounts of their valuable time searching for relevant documents. How can we improve this?

4. Quickly find relevant documents?

5. Suggest potential exact answers inside these documents?

The third target audience that we considered was the more general public — not legal professionals — that requires access and some guidance to the law. This target audience benefits from different kinds of enrichments of the original text, such as types and definitions of certain legal terms found in the text, as well as potential relationships between these terms. How can we produce this information?

6. Produce terminology relevant to the users with rich information about terms?

7. Identify new entities of pre-defined types in unseen documents?

8. Identify new types of entities?

Multilingual Europe adds another dimension of complexity to all those questions. Many regional documents are written in local languages, and, therefore, cannot be compared without some understanding of their content (translation also requires understanding).

So we want to design a platform that is able to implement the workflows of quite complex NLP services that often depend on each other. The services are developed and deployed by different partners — in Lynx we are 11 partners from 7 countries including institutions, research centers, and private companies. And of course, we need some secure way of processing and retrieving some very sensitive information. Not to mention that we expect Big Data and want to scale the services efficiently. How do we do this?

First, we identify some general/common needs. We will deal with two main types of data: textual documents and some structured knowledge in the form of a legal knowledge graph (LKG).

For the LKG we develop ontologies — mostly manual work that is done once — and extract domain-specific terminologies — this task is repeated for each new domain and language, therefore we need a way to perform this (semi-)automatically. Next, we use the developed terminologies together with many other domain-specific models to enrich textual documents.

These two processes — terminology extraction and document enrichment — should happen in advance, as preparation for solving domain-specific tasks. The interaction with the end-user is a different usage scenario and for that further end-user services are deployed. **Actually terminology extraction is just one of many possible domain adaptation tasks, other tasks include preparation of domain-specific training data and training models on it.**

We demonstrate the usage of the Lynx Service Platform (LySP) with three workflows — terminology extraction, document enrichment, end-user interaction.

Terminology Extraction

The input for this process is a collection (corpus) of domain-specific documents. The corpus is supposed to be processed by a specialized TermEx service that relies on linguistic processing and textual metrics such as TF-IDF and many others to find terms that are most specific for the provided corpus.

However, first, the user has to authenticate in a separate authentication service and use the obtained token in an API manager service that represents a single point of access to all other Lynx services. Our authentication service is based on Keycloak. Keycloak acts as an OAuth2 authorization server and stores the user permissions, thus managing user roles and access to private areas of storage. Overall this guarantees efficient security and user rights management managed by a single service for a portfolio of other services.

When the terminology is extracted it is usually manually refined and enriched using a tool able to manage terminologies — in Lynx we use PoolParty. Once the terminology is prepared it is used for entity linking.

Document Ingestion

The goal of this workflow is to receive the documents, annotate the documents using the annotation services, and store them. LySP includes 7 services that enrich the document with additional annotations. Some services, for example, the relation extraction service, depend on annotations from other services. Others can run in parallel. To efficiently orchestrate the different services we use a dedicated workflow manager based on Camunda.

BPMN diagram of document ingestion

In order to enable parallel execution of the enrichment services, it is essential that the annotations can be just added to the document without invalidating the previously obtained annotations — we do not know in advance which annotations will come first. To mitigate this problem we have defined a specialized ontology (data schema) around the LynxDocument class. The LynxDocument is a subclass of NIF Context and inherits all the nice features. Namely, the annotations are added as separate blocks that link to the original text using the char offsets. This way we can always add annotations without modifying the original text. Moreover, the annotation units allow us to annotate the same exact piece of text with different information coming from different services.

Example of a LynxDocument in JSON-LD

To efficiently store, update, and retrieve the documents or their individual annotations we use a document manager service (DCM). DCM enables an efficient way to store and manipulate RDF data. The functionality of the service is inspired by Linked Data Platforms — W3C recommendation.

All the services in the platform have common rules for the development of the APIs using OpenAPI3 specifications. The rules include common codes for (error) messages, conventions for the naming of parameters, conventions for the routes of endpoints (API gateway patterns). Compliant services can be properly called from the workflow manager with a relatively low development effort. The responses can be properly processed, and the user can get information about the execution of the service directly from the workflow manager. The APIs of different services can be found at https://lynx-project.eu/doc/api/. These principles enable REST-style architecture.

End-User Services

For the usage by the end-users, we have some additional services, for example, question answering (QA). Yet the enrichment services can be also called in the interaction with the end-user, for example, for enriching the user input. In this case, the response is expected in near real-time. To handle this challenge efficiently we containerized most service and deployed them in an orchestrated application platform OpenShift. Such deployment strategy allows for additional scalability as additional instances can be deployed on-demand. Moreover, the service can be quickly deployed at a new infrastructure — all together or individually — for example, to enable local processing of some sensitive data. Overall this idea follows the microservice architecture pattern.

Summary

The key components of LySP architecture:

  1. Token-based OAuth2 protocol for authorization together with the centralized access control and authorization rules management based on Keycloak,
  2. LynxDocument schema,
  3. Containerized deployment in an orchestrated application platform — microservices architectural pattern,
  4. Workflow manager based on Camunda,
  5. LDP-inspired document manager,
  6. Common rules for the development of APIs — REST + API gateway patterns.

We find this setup to be particularly efficient to enable the integration of new services, modifications of existing workflows, and providing secure and scalable processing of Big textual Data. At the same time, LySP can be flexibly integrated into some existing services or frontends or even deployed locally.

BONUS

--

--

Artem Revenko
Semantic Tech Hotspot

PhD in applied math and CS. Interested in Semantic Web, NLP, Information Extraction, Machine Learning and friends