Leveraging microservices for relationship-based text exploration
Alex Anikiev, Arman Rahman and I recently worked on building a text exploration tool which enables users to quickly dive into a collection of documents and use knowledge buried in the corpus to get insights on new topics and questions. For example, one might ingest a set of articles about science and then automatically establish links between scientists and their research domains.
A core part of the document exploration tool we built is a visualization that shows how various entities in the corpus relate to each other and that lets the user interactively explore further connections from each of those entities. This type of visualization enables the user to explore cross-document and cross-corpus linked ideas which otherwise may be difficult to discern.
In this article we’ll show how to use open information extraction and the resource description framework to build such an interactive entity-graph exploration widget. We’ll also show how microservices enable us to build a solution that leverages best-of-breed libraries across a wide variety of language ecosystems with minimal amounts of custom code.
There are three main components to our relationship-based text exploration widget:
- A relationship extraction component enables a user to ingest a new document into the system and runs all analysis steps required for going from raw documents to interesting links and connections.
- A relationship storage component takes the extracted relationships and stores them in a manner that can be efficiently and flexibly queried.
- A relationship visualization component enables a user to explore the corpus of documents ingested into the system.
We’ll explore each of these components in more detail in the following sections.
Taking an unstructured document and extracting meaningful relationships from it is an active area of research in the natural language processing community. Information retrieval can be used to extract insights from documents but it’s often hard to structure or interconnect the outputs to create more complete knowledge representations. Information extraction can be used to map documents onto rich structured ontologies but creating and maintaining comprehensive ontologies requires substantial amounts of expensive manual effort. A promising way to approach the structured relationship extraction problem in the absence of formal ontologies is open information extraction (OpenIE).
OpenIE systems take unstructured input sentences and produce structured subject-relation-argument triples that encode the core pieces of information contained in the sentences. For example, for a sentence such as “Albert Einstein was born in Ulm and died in Princeton”, an OpenIE system might produce the subject-relation-argument triples (“Albert Einstein”, “was born in”, “Ulm”) and (“Albert Einstein”, “died in”, “Princeton”). The information captured in these triples is inherently more machine consumable than the original sentences since multiple facts can now be linked to specific actors and a network of relationships can be established automatically. For example, with the extracted relationships above, we can now create connections between the entities “Albert Einstein”, “Ulm” and “Princeton” via the “was born in” and “died in” predicates.
Several great OpenIE systems have been published, for example by Stanford University, Freiburg University and the University of Washington. In this section, we’re going to show how to use ClausIE, the OpenIE system created by the Max Planck Institute since it is published under a relatively easy-to-integrate open source license and offers near state of the art performance.
ClausIE is implemented in Java so to turn it into a microservice consumable from other ecosystems, we used Spark to wrap the OpenIE functionality. Spark is a lightweight web framework for Java, comparable to Ruby’s Sinatra or Python’s Flask. Spark’s high-level API makes it easy to implement a simple route to wrap the ClausIE system:
Note that in the example above we’re explicitly enabling multipart requests in Spark to enable efficient processing of large text documents in a streaming fashion which reduces memory requirements for the service.
As shown in the example above, the open information extraction endpoint expects a text file as input where each line corresponds to one sentence. To prepare arbitrary documents to fit this format, we use two additional microservices for document pre-processing.
Note the “accept” header in the code snippet above. This header is required when calling the Apache Tika service to ensure that the document parser returns plain text output as opposed to say XML or HTML; failure to specify the header will lead to undefined behavior.
After turning raw documents into plain text, we then use a simple Sanic service that leverages the excellent text-processing tools available in Python, such as NLTK, to easily parse the extracted document text, split it into sentences and perform rich cleanups such as discarding poorly converted sentences and so forth:
The cleanup service can then be run via Docker, just like the OpenIE server and the Apache Tika server:
In addition to the pre-processing steps described above, relationship pruning post-processing steps are often beneficial to ensure that the OpenIE extractions are of high quality for downstream use. OpenIE is domain independent by design which means that adding domain-specific processing steps will improve the salience of the extracted relationships for the particular context of the ingested documents, for example by canonicalizing multiple written forms of the same logical entity to one root textual string to increase the density of the OpenIE relationship graph for that entity. We’ll cover the topic of OpenIE post-processing in a future article.
Note that the pre-processing, relationship extraction and post-processing steps are each embarrassingly parallel: every document can be processed independently of the rest of the corpus. Given that the services introduced above are already containerized, it is easy to scale the ingestion process, for example by deploying the pipeline to a managed Kubernetes cluster and setting up a horizontal pod autoscaler.
In the previous section we described three microservices that enable us to extract semantically interesting subject-relation-argument triples from arbitrary documents. Ingesting a document via the pipeline takes a few minutes of processing so it’s impractical to run the pipeline live each time a user makes a query. As such, we need to define a method to store the produced relationships at ingestion time, as well as a flexible querying mechanism to retrieve them at exploration time.
A natural fit for storing triple information that represents linked facts is the resource description framework (RDF). RDF is a data format that was created as part of research into the semantic web to enable machine-consumable description of information with support for complex querying and inference. RDF is supported by a mature ecosystem of tools and libraries which make it easy to implement functionality on top of it.
We can convert the JSONL output of our OpenIE microservice to RDF in Turtle format using Python’s RDFlib. Turtle is an efficient textual storage format for RDF triples that can be ingested by most RDF-based tools:
We can then store the RDF relationships in the Apache Jena database. Apache Jena is a high performance single-node RDF data-store. The Fuseki REST interface makes it easy to interact with Apache Jena:
Similarly, the Fuseki REST interface also makes it easy to query our RDF database using SPARQL. SPARQL is a powerful SQL-like querying language that enables complex queries across RDF relationships. For example, to find all relationships about victories one could run the following query:
Note that RDF inherently stores relationships as pure graphs. As such, duplicate relationships are not supported. To preserve duplicates across a collection of documents, the snippets above can be adapted to use reification or a named graph per ingested document.
Reification is the process of introducing meta-statements to the RDF store that claim provenance of each factual statement, such as (“document X”, “said that”, “relationship Y”) where relationship Y is a pointer to a statement like (“Albert Einstein”, “was born in”, “Ulm”). The disadvantage of this approach is that for each fact stored in the RDF database, multiple meta-facts now have to be stored which drastically increases storage size. Named graphs, on the other hand, solve the provenance issue by assigning statements to explicit sub-graphs in the RDF store at ingestion time, similar to having shards in a traditional SQL database. SPARQL then enables retrieval of facts from one or more named graphs at querying time via the FROM NAMED statement.
If a use-case makes it necessary to preserve duplicates inside of a single document (e.g. to visualize how often particular entity relationship occurred in the document), it’s easiest to mirror the relationships to another datastore such as PostgreSQL in addition to storing them in Apache Jena. This polyglot storage solution enables in-document duplicate-based queries executed against PostgreSQL in addition to graph-based SPARQL queries executed against Apache Jena.
Tools like Sparqlify could be used to run SPARQL queries directly against PostgreSQL and remove the need for maintaining two datastores but this is not recommended for a production setup since SPARQL to SQL translation often only supports a limited subset of the SPARQL querying language which limits query flexibility.
One advantage of using a technology like RDF to store and query the entity relationships extracted from our documents is that we can draw from a rich ecosystem of user interaction tools that have been built in this area.
For example, we can leverage the YASQE library to easily create a rich user interface for writing SPARQL queries with code completion, syntax highlighting, query execution, and so forth:
Given that the result of a SPARQL query always describes a graph, we can implement a simple generic re-usable visualization for the query response using the Cytoscape graphing library:
Note that the snippet above assumes that the SELECT statement in the user’s SPARQL queries will always contain a subject clause, an argument clause and a relation clause.
We can use the query editor and graph visualization shown above to for example find all relationships that start with the entity “Albert Einstein” (as in previous examples in this article):
Cytoscape also enables us to easily post-process the displayed graph. For example, to only display interesting dense relationships that span more than two entities, we could execute the following:
In this article we introduced a text exploration pipeline that visualizes relationships between entities within a document and across a collection of documents.
The pipeline leverages a number of open source tools such as Apache Tika, Apache Jena Fuseki, ClausIE, NLTK, YASQE and Cytoscape. Using Docker and a microservices approach makes it easy to integrate these services that otherwise live in their own ecosystems.
We’ve demonstrated how to leverage powerful tools from the semantic web community such as RDF and SPARQL for natural language processing explorations. We also introduced ClausIE-Server, an open source microservice for open information extraction. If your application needs to make sense of unstructured text, give ClausIE-Server a try!
In future articles we’ll cover enhancements to the text exploration tool such as how to improve the reliability of the pipeline using transparent task queues, how to improve the quality of the generated relationships via coreference resolution, named entity recognition and domination theory or how to increase the dynamism of the visualization by auto-generating SPARQL queries based on user interactions with the entity relationship graph.