Applying Natural Language Processing for intelligent document analysis

Kamila Hankiewicz
Untrite
Published in
12 min readFeb 2, 2020

--

Natural Language Processing (NLP) and Machine Learning (ML) technologies are ideal for intelligent document analysis and comprehension. They help deriving insights from unstructured data — text documents, social media posts, mail, images, etc. When you consider that an estimated 80% of all enterprise data is unstructured, makes NLP a perfect fit for digital transformation projects.

NLP can deliver tangible benefits across industries and business functions, such as improving compliance, data governance, risk management, increasing internal operational efficiencies by intelligently augmenting business processes (e.g. project information retrieval and gathering).

In this article I will describe the main NLP techniques we use at Untrite for intelligent document comprehension and cognitive search. I will also provide examples of various business use cases. I will also discuss some key considerations for applying these technologies in an enterprise environment.

Intelligent document analysis techniques

Current advancement of NLP has couple common techniques, which combined, can be used for intelligent document comprehension and analysis. Namely:

Named Entity Recognition

Named Entity Recognition identifies named entity mentions within unstructured data — such as text document and classifies them into predefined categories, such as person names, organisations, locations, time expressions, percentages, medical codes, quantities etc. NER systems have been created that use linguistic grammar-based techniques as well as statistical models such as machine learning. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data. Semisupervised approaches are common choice to avoid part of the annotation effort.
The three most common approaches for performing Named Entity Recognition are:

  • Out-of-the-box entity recognition — Most NLP packages or services include pre-trained machine learning models for identifying entities. This is the easiest approach used to identify key entity types such as person names, organisations, and locations with just a simple API call, and without the need to obtain the data for training a machine learning model.
  • Machine-learned entity recognition — Out-of-the-box entities may seem convenient but they are typically too generic. In many cases where a business uses specific naming, it will be necessary to identify additional entity types. For example, when processing documents in a legal context, we would want to identify types of contracts, abnormal clauses etc.
  • Deterministic entity recognition — This approach is used best where the entities that you want to identify are finite and pre-defined. It gives more accurate results than training a machine learning model. In this approach, a dictionary of the entities is provided; then, the entity recogniser will identify in the text any instance of an entry from the dictionary. For example, the dictionary could contain a list of all products from a company. It is also possible to combine the dictionary approach with machine learning. The dictionary is used to annotate training data for the machine learning model which then learns to identify instances of the entities that were not in the dictionary. Deterministic entity recognition is not commonly supported in out-of-the-box NLP packages but may be provided as custom services. Some NLP packages that do support this deterministic approach use an ontology rather than a dictionary. The ontology defines relationships and related terms for the entities and this enables the entity recogniser to disambiguate between ambiguous entities using the context of the document.
  • Pattern-based entity recognition — If an entity type can be defined by regular expressions then these could be identified using regular expression matching. For example, product codes or citation references could be identified using regular expressions. A simplified regex for a UK National Insurance Number is [A-Z]{2}[0–9]{6}[A-Z] (2 uppercase letters, followed by 6 digits, followed by 1 uppercase letter). Such approach is used in KYI, identity verification where documents are format standardised etc.
Patterns…

Text Classification

Text Classification also known as text tagging or text categorisation is the process of assigning tags or categories to text according to its content. By using NLP text classifiers can automatically analyse text and then assign a set of pre-defined tags or categories based on its content. It’s applied in sentiment analysis, topic labelling, spam detection, and intent detection. Text Classification will use the words, entities, and phrases in the document to predict the classes. It usually feeds into decision additional features, such as any headings, metadata, or images contained in the document.
Text tagging works on two dimensions:

  • Number of classes — The simplest form of classification is binary classification (true, false, this or that) where there are only two possible classes into which an item can be classified. An example of this is spam filtering where emails are categorised as either spam or not spam. Multi-class or multinomial classification has more than two classes into which an item can be classified.
  • Number of labels — Single label classification categorises an item into precisely one class, whereas multi-label classification categorises an item into multiple classes. Classifying blog post by the description into multiple labels (e.g. technology, sustainability) is an example of multi-label classification. Multi-label classification originated from the investigation of text categorisation problem, where each document may belong to several predefined topics simultaneously.

In general, the lower the number of classes and labels, the higher the expected accuracy.

It’s also worth noting that the difference between multi-class classification and multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.

For example, multi-class classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either a banana or a pear but not both at the same time. Whereas, an instance of multi-label classification can be that a text might be about any of religion, politics, finance or education at the same time or none of these.

An example use case for Text Classification is the automated routing of documents, such as mail in customer service area. If an email is about refund, Text Classification is used to determine the queue to which a document should be sent so that it can then be processed by the appropriate team of specialists — in our case — refund processing team, thus saving time and resources.

Sentiment Analysis

Sentiment Analysis identifies and categorises emotion expressed within the text, such as news reports, social media content, reviews, etc. In its simplest form, it may categorise the sentiment as positive or negative; but it could also quantify the sentiment (e.g. -1 to +1) or categorise it at a more granular level (e.g. very negative, negative, neutral, positive, very positive). Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
Sentiment Analysis, like many NLP techniques, deals with complexities of language. For example:

  • Level — Sentiment can be expressed in varying degrees. For example, there is increasing positivity in “I enjoyed it,” “I loved it,” and “I absolutely loved it”.
  • Negation — Words like “not” and “never” will change the sentiment of the words used. For example, “This film does not have a gripping plot or likeable characters.”.
  • Conflicting — The text may include both positive and negative sentiment. For example, should “Their first movie was great, but their second one was a complete misunderstanding” — should that be considered as positive, negative, or a neutral statement?
  • Slang — Slang can often have the opposite meaning to its conventional meaning. For example, the word “sick” would have a very different meaning depending on the context in which it is used (“The food at this restaurant made me sick” vs. “That new video game release is sick!”) or on the demographic of the author.
  • Implied — In the sentence “I’ll be angry if the delivery is late,” the negative sentiment is conditional on something which has not, and may not, happen. In the sentence “They used to be good,” a positive sentiment is expressed about the past, but a negative sentiment is perhaps implied about the present but unclear.
  • Entity level — Entity-level sentiment analysis provides a more granular understanding of the sentiment by considering the sentiment at an entity level rather than document or sentence level. This will resolve the ambiguity seen in the example in the “Conflicting” scenario (“Their first movie was great, but their second one was a complete misunderstanding.”). It does so by assigning a positive sentiment to the first movie (the first entity) but a negative sentiment to the second movie (the second entity).

Sentiment Analysis is often used to analyse user generated posts relating to a company or its competitors. It can be a powerful tool to:

  • Track sentiment trends over time
  • Analyse the impact of an event (e.g. a product launch or redesign)
  • Provide an early warning of a crisis
  • Identify key influencers

Text Similarity

Finding Text Similarity between two sentences is central to many NLP applications. This technique calculates the similarity between sentences, paragraphs, and documents. To calculate the similarity between two items, the text must first be converted into an n-dimensional vector which represents the text. This vector might contain the keywords and entities in the document or a representation of the topics expressed in the content. The similarity between the vectors and therefore the documents can then be measured by techniques such as cosine similarity.

Source: TensorFlow

Text Similarity can be used to detect duplicates and near-duplicates in documents or parts of a document. Near-duplicate identification calculates document similarity based off textual content. For example, if you had two documents containing exactly the same text — one being a native email file and the other being a PDF version of that same email — the hash values of the two files would be entirely different. However, near-duplicate identification looks at just the textual content of the two documents and can determine that they are very similar to each other.
Here are examples:

  • In legal or HR matters, Text Similarity task allow to mitigate risks on a new contract, based on the assumption that if a new contract is similar to a existent one that has been proved to be resilient, the risk of this new contract being the cause of financial loss is minimised.
  • Search engines need to model the relevance of a document to a query, beyond the overlap in words between the two. For instance, question-and-answer sites such as Quora or StackOverflow need to determine whether a question has already been asked before.

Information Extraction

Information Extraction extracts structured information from unstructured text. Information extraction benefits many text and web applications, for example, integration of product information from various websites, question answering, contact information search, finding the proteins mentioned in a biomedical journal article, and removal of the noisy data.

That’s not to be confused with information retrieval. Main difference between these two approaches is that for the purpose of information extraction relevant facts of interest are specified in advance, while information retrieval tries to discover documents that may have facts of interest for the user that the user is not aware of.

An example use case of information extraction case could be used where normally paralegal would go through the entire document and highlight important for the team points from the document, such as identify the financial liability or finding information relevant to a legal decision, checking that a contract is complete and avoids risk.
When it comes to Information Extraction, it’s the understanding of the context of the entities that helps to determine which is the correct answer.

Relationship Extraction

Relation Extraction is the task of extracting semantic relationships from text, which usually occur between two or more entities. These relations can be of different types. E.g “Louvre is in France” states a “is in” relationship from Louvre to France. This can be denoted using triples, (Louvre, is in, France).

Similar to Information Extraction, Relationship Extraction relies on Named Entity Recognition, but the difference is that it is specifically concerned with the type of relationship between the entities. Relationship Extraction can be used to perform Information Extraction.

Some NLP packages and services provide out-of-the-box models for extracting relationships, such as “employee of,” “married to,” and “location born at.” As with Named Entity Recognition, custom relationship types can be extracted by training specific machine learning models.

Relationship Extraction can be used to process unstructured documents to identify specific relationships which can then be used to populate a Knowledge Graph.

For example, this technique can extract the relationships between diseases, symptoms, drugs, etc., by processing unstructured medical documents.

Summarisation

The main aim of text summarisation is to create a coherent and fluent summary consisting of all the main points from the text. There are two different approaches:

  • Extraction-based summarisation extracts sentences or phrases without modifying the original text. This approach generates a summary composed of the top N most important sentences from the document.
  • Abstraction-based summarisation uses Natural Language Generation to paraphrase and condense the document. This is much more complex and experimental than the extraction-based approach.
Source: DataPy

Text Summarisation can be used to enable humans to quickly digest the content of large volumes of documents without the need to fully read them. An example of this is news feeds or scientific publications where there is a large volume of documents being constantly generated.

Complexities of intelligent document analysis tasks

Machine learning tends to be much more complex on unstructured text than it is on structured data and so it is much harder to match human-level performance for analysing text documents.
Things to consider:

Language complexity

In linguistics, complexity is a characteristic of a text but there are multiple measures and hence multiple implied definitions in practice. In Natural Language Processing, these measures are useful for descriptive statistics. There are two most popular ways of assessing textual complexity: how readable your text is (textual readability) and how rich it is (textual richness).

It takes humans years to understand language because of the variation, ambiguity, context, and relationships that it contains. We use different styles depending on the subject, author and audience and choose to use synonyms to add interest and avoid repetition. IDA techniques must be able to make sense of those differences to derive accurate insights.

IDA requires the understanding of both general language and domain-specific terminology. One approach for handling domain-specific terminology is to use custom dictionaries or build custom machine learning models for entity extraction, relationship extraction, etc.

An alternative approach to tackling the problem of combining general language and domain-specific terminology is Transfer Learning — it’s the ability to transfer the knowledge of a pre-trained model into a new condition.
By taking an existing Neural Network which has been trained on huge volumes of general text and then adding extra layers, we are able to train the combined model using a smaller amount of content which is specific to the problem. This is analogous to how we develop our knowledge through years of schooling. The extra layers are analogous to the domain or task-specific learning which happens when the person leaves school and starts working.

Accuracy

The accuracy for IDA techniques depends on the variation, style, and complexity of the language used. It can also depend on:

  • Training data — The quality of a machine learning model depends on the volume and quality of the training data.
  • Document size — For some techniques, such as Text Classification and Similarity, large documents are helpful because they provide more context. Other techniques including Sentiment Analysis and Summarisation are harder on large documents.
  • Number of classes — The accuracy of techniques such as Text Classification, Sentiment Analysis, Entity Extraction, and Relationship Extraction, will vary depending on the number of classes and types of entity/relation and the overlap between them.

— —

How can I apply Intelligent Document Analysis to my projects?

Depending on the project, you will be able to choose one of two ways of integrating NLP into your business base:

  • Automation — is possible where business process follows repetitive, logical steps. This approach is used to automate an existing or new process without any human intervention
  • Semi automated process with an agent support— Human-in-the-loop — this approach is used to provide support for a human when making a decision, but the human has the final responsibility.

Before you go on sourcing NLP provider, consider these three steps:

  1. Pick a use case which either has a low cost of incorrect decisions or where a human makes the final decision.
  2. Start with a proof of concept to determine if the approach is feasible.
  3. Iteratively add complexity to increase the accuracy of the application.

This process will allow you to become familiar with the techniques and for your business sponsors to gain confidence in them before tackling the more complex use cases with higher benefits.

With thorough planning and implementation strategy, your organisation can leverage the NLP and machine learning techniques discussed above to merge smart applications that improve business processes and outcomes.

--

--

Kamila Hankiewicz
Untrite

I'm all about tech, business and everything in between | @untrite.com @oishya.com @hankka.com | @untrite.com @oishya.com, @hankka.com, ex-MD Girls In Tech