Utilizing dependency trees and NER models for relation extraction task

Dx dy — Fri, 26 Mar 2021 13:03:33 GMT

Machine learning solutions rarely consist of a single model in particular areas. Especially NLP. You do a lot of preprocessing from linguistics perspective when you prepare your data for the model. This article focuses on syntax and semantic analysis of natural language, utilizing both modern approaches and some “old school” technology for relation extraction task.

For the sake of simplicity — imagine we are building conversational model/chat bot/intelligent assistant etc. It is not so hard to create one, but it is definitely hard to make one, that stands out. Say you need to develop a module, where user enters text/uses voice (using voice is entirely another story, that may improve “level of cool” from user perspective and make NLP engineer’s life much harder :)). Usually the pipeline consists of intent recognition module for understanding what does the user want, NER-component for recognizing entities and their properties, and dialogue policy for managing the whole conversation. Lets focus on the problem of the NER-component, which may not seem a problem at all, but is very interesting to solve when you get into the details. Consider it “lemon pasta problem”.

“I want an orange juice and lemon pasta”

Pasta alla limone. Created using Dx dy

Lets import everything we need and create a stub for NER-model.

Main concept of NER— is keeping separate target entities (products) and tokens or phrases which bring additional description of the entity, like taste, color, material etc. This can be easily extended in the same way for requests, where user enters parameter names and parameter values.

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_lg")

doc = nlp("I want an orange juice and lemon pasta")
tags = ["O", "O", "O", "U-product-description", "U-product", "O", "U-product-description", "U-product"]

for i, (w, t) in enumerate(zip(doc, tags)):
    print(f"{w}\t\t{t}")

Assuming we have more or less trained model for NER task we expect the following output for this particular case (BILOU-tagging scheme is preferable here. BILOU stands for Begin-Inside-Last-Outside-Unit. If the word does not seem familiar — check different NER tagging schemes).

I		O
want		O
an		O
orange		U-product-description
juice		U-product
and		O
lemon		U-product-description
pasta		U-product

Question is, how does the ML part understand that juice needs to be orange and pasta needs to be lemon and not vice versa? I dare say, some might not notice any difference between lemon and orange pasta if they haven’t tried one, but one can certainly tell the difference between lemon and orange juice.

Dependency trees

Using dependency tree seems the most straightforward idea. They can be built using spaCy or NLTK. Consider reading spaCy documentation and/or Wiki if the term does not ring a bell. Visualization below was created with spaCy. Nouns (“juice” and “pasta” in this case) are often connected to adjectives, adverbs, and other nouns that describe them in a certain way, etc. While those very adjectives, adverbs, etc have “head-connection” to nouns. Dependency tree below shows, that words “orange” and “lemon” have “head connection” (counter arrow-wise direction) to nouns “juice” and “pasta”.

doc = nlp("I want an orange juice and lemon pasta")
options = {"bg": "white", "distance": 130,
           "color": "black", "font": "Source Sans Pro"}

displacy.render(doc, style="dep", options=options)

Dependency Tree. You can see dependencies between words just below the arrows

Bearing this in mind, we have to traverse certain nodes of tree, to find pairs of products(dishes) and phrases or words that describe them, most accurately. There are 2 options. You either start from the “noun-node” like “juice” or “pasta” and do BFS of all their children nodes or you can start from the opposite side — and just go to parent node. In my opinion — second approach is more simple to implement, ergo — less space for error.

This is utility function for finding entities in sentence and start/end indices (token-level, not string).

def get_entity_indices(document, tags):
    result = []
    
    start_index = -1
    end_index = -1
    
    for i, (w, t) in enumerate(zip(doc, tags)):
        if "U-product" in t or "B-product" in t:
            start_index = i
        if "U-product" in t or "L-product" in t:
            end_index = i
        
        if start_index > 0 and end_index > 0:
            result.append({
                "value": " ".join([doc[j].text for j in range(start_index, end_index + 1)]),
                "entity_type": t[2:],
                "start": start_index,
                "end": end_index,
            })
            
            start_index = -1
            end_index = -1
            
    return result

Here comes the tree traversal part. Key idea is — just go to head node, until you find, what you are looking for or reach the root of the tree.

def find_head_relation_index(token, tags):
    i = token.i
    found = False
    
    while not found:
        if tags[token.i] in ["B-product", "I-product", "L-product", "U-product"]:
            found = True
            
            return token.i
        
        token = token.head
        
        # Means the root of the tree has been reached
        if token == token.head:
            break

Now comes the final function that wraps it all. Since ML does not guarantee 100% accuracy — you have to either leave space for error or handle them. Ensemble models rely on each other very much, so we have to keep in mind that we may operate on data with errors from previous model as well as incorrect output from the current model.

We will iterate through tokens and tags simultaneously and if the tag is *-product-description we will try to “wire it up” to *-product via “head transition”. However, we need to keep all the entities found by NER, if tree traversal does not yield any result for particular entity, that’s why there is a workaround with entity_indices.

def process_dependency_tree(document, tags):
    result = []
    
    # Will be further used for "not found" entities aka "appendix"
    entity_indices = get_entity_indices(document, tags)
    
    def extract_entity(index):
        value = None
        deletion_ix = -1
        
        for i, entity in enumerate(entity_indices):
            if entity["start"] <= index <= entity["end"]:
                value = entity["value"]
                deletion_ix = i
                break
        
        ### Once entity is succesfully found - it will be removed from the "appendix"
        ### Appendix contains entities which cannot be bound/were tagged by NER-model incorrectly
        ### This way information losses will be minimized
        if deletion_ix > 0:
            entity_indices.pop(deletion_ix)
                
        return value            
    
    for i, (w, t) in enumerate(zip(doc, tags)):
        # It is easier to traverse from "the end" of the "description"
        if t == "U-product-description" or t == "L-product-description":
            product_index = find_head_relation_index(token=w, tags=tags)
            
            product = extract_entity(product_index)
            description = extract_entity(i)
            
            result.append({
                "product": product,
                "description": description
            })
    
    return {
        "successfully_processed": result,
        "appendix": entity_indices
    }

Lets try processing dependency tree on both correct and incorrect NER-tagging results and focus on errors and how to deal with them.

Case 1. Everything is correct.

print(process_dependency_tree(doc, tags))

Yields

{'succesfuly_processed': [{'product': 'juice', 'description': 'orange'},
  {'product': 'pasta', 'description': 'lemon'}],
 'appendix': []}

We have managed to establish dependencies correctly.

Case 2. NER error

Lets intentionally bring an error into tagging results, and see how the “appendix-trick” handles it.

tags_with_error = ["O", "O", "O", "U-product-description", "U-product", "O", "B-product", "L-product"]

We have merged 2 last tokens into one entity. Rather common type of mistake for NER-model.

I		O
want		O
an		O
orange		U-product-description
juice		U-product
and		O
lemon		B-product
pasta		L-product

Nevertheless, the entity hasn’t been dropped out.

process_dependency_tree(doc, tags_with_error)

{'succesfuly_processed': [{'product': 'juice', 'description': 'orange'}],
 'appendix': [{'value': 'lemon pasta',
   'entity_type': 'product',
   'start': 6,
   'end': 7}]}

Case 3. More complex dependency tree

Alas, head traversal does not handle all the cases. This is omitted for the sake of simplicity. In this case you need to check, if there is a path of specific dependency relations between entity and its description. Try entering examples of your own and examining dependency tree. Now lets mention a couple of alternatives which can be used as “plan-B” as well as stand-alone solutions.

Alternatives

Constituency tree

Try operating on noun/verb-phrase level for information extraction

Constituency tree built using simple regexp grammar and NLTK. Consider using tools like Stanford NLP for more serious tasks.

Transformers attention layer

Extract relations from last attention layers of transformer models like BERT. Using tools for visualization of attention layers will be of great help. Take your time, selecting appropriate model/layer. Outputs can be somewhat unpredictable from human perspective, though:)

BERT base uncased visualization created using BertViz

Wrapping it all up

Relation extraction might be not so beginner friendly, especially when your journey into the world of NLP has just begun. Here is a couple of links which may be of help

Stories by Dx dy on Medium

Utilizing dependency trees and NER models for relation extraction task

Dependency trees

Alternatives

Wrapping it all up