Dependency parsing & associated algorithms in NLP

Published in

Data Science in your pocket

13 min readMay 11, 2020

In continuation of my NLP series, I will be unfolding Dependency parsing this time.

For any confusion related to basic terminologies used for parsing, such as
What is parsing?
Why do we need it?
What is Context-Free Grammar?
Do check here.

My debut book “LangChain in your Pocket” is out now !!

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs eBook : Gupta, Mehul…

www.amazon.in

Basically, last time we discovered Syntactic/Constituency parsing and how it creates a parsing tree using a Context-Free Grammar which is basically a set of rules to follow. So, we can say it generates a parsing tree based on rules.

But, as we know text data is very versatile.

Using rules will be missing out on a lot of information.

Consider the statement:
‘He is standing behind the wall’.

In some languages, the same statement can be written in such a way that on translating in English (word to word translation), it might yield ‘He is standing wall behind’ i.e the word order is free. It isn’t necessary the adjective has to appear just before the noun it like ‘Handsome boy’. It can be ‘Boy handsome’. Hence, using rules for parsing such languages is problematic!!

And that is why dependency parsing is required. Dependency parsing helps us build a parsing tree with the tags used determining the relationship between words in the sentence rather than using any Grammar rule as used for syntactic parsing which gives a lot of flexibility even when the order of words (like ‘boy handsome’ or ‘handsome boy’) get changed. A sample dependency tree can be observed below:

It has some advantages of syntactic parsing that are:

It can handle morphologically rich languages( this term can be explored here) that have a relatively free-word order as compared to English. For example:

In Czech, a grammatical object might occur before or after a location adverbial. For example: ‘over the table’ can become ‘the table over’ in Czech & still make sense

The Head-Dependent relationship extraction(discussed later) helps in semantic parsing (especially question answering system, information extraction, etc). Do read the previous post on syntactic parsing for having a gist of what is semantic parsing.

As we now know the intention behind using Dependency parsing, we must proceed further !!

Below is an example of Dependency parsing for:

‘I prefer the morning flight through Denver’

Let us some terminologies as well used for Dependency parsing using the above example.

Head-Dependent: In the arrows representing relationship, the origin word is the Head & the destination word is Dependent.

For example: (I, ‘nsubj’, prefer), ‘prefer’ is Head & ‘I’ is Dependent.

Root: Word which is the root of our parse tree. It is ‘prefer’ in the above example.
Grammar Functions and Arcs: Tags between each Head-Dependent pair is a grammar function determining the relation between the Head & Dependent. The arrowhead carrying the tag is called an Arc. Below is the table that interprets the meaning of some common grammar functions:

Note: Do search out Grammatical terms like ‘Clausal compliment’, ‘appositional modifier’, etc mentioned in the post for a better understanding.

Considering ‘Det’ function in the above example. It states ‘the’ is used as ‘Determiner’ for ‘flight’.

Some examples for Grammar Functions are given below:

Understanding the Dependency parse tree

Basically, we represent dependencies as a directed graph G= (V, A) where V(set of vertices) represents words (and punctuation marks as well) in the sentence & A( set of arcs) represents the grammar relationship between elements of V.

A dependency parse tree is the directed graph mentioned above which has the below features:

Root has no Incoming arcs (can only be Head in Head-Dependent pair)
Vertices(except Root) should have only one incoming arc (Only one Parent/Head)
A Unique path should exist between Root & each vertex in the tree.

We must be aware of a very important concept i.e Projectivity before going on retrieving a parse tree

Projective arc: An arc/arrows_with_tag are projective when ‘Head’ associated with the arc has a path to reach each word that lies between ‘Head’ & ‘Dependent’.

Here, the arc between ‘the’ & ‘flights’ is projective as the ‘Head’ i.e ‘flights’ has a path to ‘morning’ as well which lies between ‘the’ & ‘flights’. Similarly, the arc between ‘canceled’ & ‘flights’ is also projective as ‘canceled’ has a path to ‘the’(canceled →flights →the) & ‘morning’ (canceled →flights →morning) which lies between Head ‘canceled’ & Dependent ‘flights’

Projective parse tree: A parse tree with all its arcs projective. The above tree is projective.
Non-projective parse tree: A tree with at least one of the arcs as non-projective. Have a look at the below example

Here, the arc between ‘flight’ & ‘was’ is non-projective as there exists no path between ‘flight’ & ‘morning’

Now, we will move forward on how to create a dependency parse tree.

The below method is amongst one of the algorithms of the ‘transition-system’ family which are helpful for extracting parse trees for projective dependency trees.

Shift Reduce Parsing (Arc standard):

We need some setup for this algorithm:

A Stack
A buffer of input words
Oracle function/parser & set of dependency relations (like obj, iobj, etc tags that are used to represent arcs).

Initially, the Stack has [Root], a dummy variable. All the words of the sentence to be parsed are in the buffer.
A word is popped from the buffer & pushed into Stack.
As a word is pushed in Stack, the Oracle/parser takes the top two elements of the stack & outputs/predict any of the below 3 transitions with relationship type:

LEFTARC_FUNCTION: Assert a head-dependent relation between the word at the top of the stack and the word directly beneath it; remove the lower word(dependent) from the stack.

2. RIGHTARC_FUNCTION: Assert a head-dependent relation between the second word on the stack and the word at the top; remove the word at the top of the stack(dependent);

3. SHIFT: Remove the word from the front of the input buffer and push it onto the stack.

Here, ‘function’ represents ‘iobj’, ‘nobj’, ‘conj’, etc tags that we discussed above.

Hence, the actual prediction has two parts:

Determining Head & Dependent using LEFTARC/RIGHTARC
Determining their relation type i.e. functions like conj, iobj, etc.

Hence the actual predictions done by Oracle are something like this LEFTARC_IOBJ, RIGHTARC_CONJ, etc.

For simplicity purposes, I will be using just LEFTARC & RIGHTARC notations in the below example.

We will stop as & when the buffer is empty & stack has only [Root] left.

Some restrictions are also considered:

No LEFTARC, if we have [Root] as the 2nd element of the stack & buffer, is non-empty
At least 2 elements should be stacked for LEFTARC/RIGHTARC transition.
[Root] can never be a dependent (as it is a dummy variable)

It must be noted that an Oracle is a model trained on a training set extracted using Dependency Treebanks.

A Dependency Treebank is a text corpus in which each sentence has a corresponding dependency tree which is usually either extracted using Syntactic Treebanks(like PennTreebank) or manually marked by humans.
Won’t be going in detail on the training of Oracle!!

Examples always help!! consider the statement: ‘book me the morning flight’. We will be creating a dependency parse tree for this sentence

Word List: word buffer, Action: transition & relationship tag predicted by Oracle. Though in this example, we consider Action as only transition to avoid confusion.

Step-0: As Stack has only one element [root], SHIFT is predicted(exception cases mentioned above) & ‘book’ is pushed.

Step-1:[root, book] are popped out from Stack, Oracle predicts SHIFT hence they are put back, and ‘me’ is pushed in Stack.

Step-2:[book, me](top 2 elements of the Stack currently) are popped, Oracle predicts RIGTHARC, hence declaring ‘book’ as Head & ‘me’ as dependent & removing ‘me’ from Stack; pushing back ‘book’ & adding relation:(Book→me)

…….

Step-6: The top 2 elements ‘morning’ & ‘flight’ are popped, and Oracle predicts LEFTARC i.e. ‘flight’ as Head & ‘morning’ as Dependent. ‘flight’ is pushed back while ‘morning’ is removed.

………

Step-9: Now, we can pop ‘root’ & ‘book’ as the buffer is empty(which we avoided in Step-0 due to the exception mentioned). RIGTHARC is the relation detected & with this, the parse tree is obtained. Using the relations found, a directed graph can be easily established with ‘book’ as the root node.

A couple of things to note down:

The transitions predicted are along with the relationship tag as I wrote earlier. i.e Oracle predicts something like LEFTARC_RELATIONSHIP, RIGTHARC_RELATIONSHIP & not just the transition.
Oracle doesn’t give correct answers always!! it is a machine learning model trained on some dataset & hence can go wrong as well.

The above method is called the Arc Standard method which is a type of shift-reduce parsing.

You can also explore the Arc Eager method which has some enhancements over the Arc Standard. Various other algorithms also exist which I would be skipping for now.

The disadvantage of using ‘transition-based methods’ is they can’t be used for non-projective parse trees due to computational complexity.

Graph-based parsing:

We have another family of algorithms for creating dependency parse trees i.e ‘Graph-based-systems’ which have some advantages over ‘Transition-based’ algorithms:

1.Better accuracy
2.Can work with non-projective dependencies as well

This family of algorithms aims at going through all possible parse trees for a sentence & choosing the best depending on some sort of score(ideology similar to PCFG discussed in the previous post). We will be focusing on edge-factored scores here in which we consider the summation of all edge scores (will be discussing how this score is calculated later in the post) of a tree to choose the best parse tree.

Before moving on, have you heard of a Maximum Spanning Tree?

We need it & hence an introduction is a must.

This post explains it the best.

Some key facts from the post are that an MST is a subset of a Graph(parse tree in our case) with the following features:

Connects all vertices with the minimum possible number of edges
Has the maximum weight (summation of all edges scores involved) amongst all the possible spanning trees that connect all vertices.
The weights of edges should be relative i.e if we perform any operation on the weights, the MST shouldn’t be affected
Each vertex should have only one incoming node in an MST
Should have no cycle

We will be using the Chu-Liu-Edmonds MST algorithm for deriving these dependency parses.

Wait a minute!!

Why aren’t we using Prim & Kruskal algorithms for MST (most common algorithms for MST, CSE guys must have heard of it)? Why do we need a different algorithm for performing the same task that can be done by existing solutions?
A simple answer to this is that Prim & Kruskal have some cons that can affect our parse trees. For more on this, read here.

If we greedily choose incoming edges with maximum weights for each vertex in the graph, can this form an MST? just a thought !!

It can sometimes!!

But this approach may lead to cycles as well in many cases. Hence if we can eliminate these cycles, we can use the Greedy approach of selecting edges for MST.

The below algorithm does what we wish!!

Chu-Liu-Edmonds Maximum Spanning Tree:

This algorithm has three major parts. We shall explain them using an example. Let us pick up the sentence

‘book that flight’

Create a directed graph with an additional vertice ‘Root’ having outgoing edges to all words of the sentence. It can be observed that all words have an outgoing edge to all other words (except Root) in the Graph G. Also, each word receives an incoming edge from all the vertices (including root). 2 quick questions:

How do we get the Graph G? how the scores of edges are calculated? will be discussing this later in the post

Note: Take faded lines similar to other lines(some issues with the image)

2. Greedily, choose incoming edges for each vertex with the highest weight

Here, it can be observed that the blue lines depict the greedy selection for incoming edge for each vertex (word). if we get an MST in the very first iteration, GREAT!!! you can stop here only. But as you see, we got a graph with cycle(that →flight, flight →that)

3. Subtract weight chosen greedily from weights of all other incoming edges for that vertex.

It can be observed that for incoming edges to ‘Book’, 12 (weight of greedily choosen edge for ‘book’. This value is 8 for ‘flight’ & 7 for ‘that’) has been subtracted leading to (5–12=-7) for flight →Book,
that →Book(6–12=-6). Similarly, all other values are updated throughout the graph G.

3. Collapse the Graph. Pick the cycle existing (that →flight, flight →that) & merge them to form a single node. Note that this can be detected if a vertice has an outgoing & incoming edge to the same vertex with weight 0 as can be observed here between ‘that’ & ‘flight’. We get a new graph after this step.

Here, ‘tf’ represent the merged vertice formed by ‘that’ & ‘flight’

4. Repeat from step 2 i.e Greedy choice of incoming edges for all vertices.

5. As we can see, we don't encounter any cycle anymore i.e we have got an MST.

6. Once we get an MST, we have to trace back to split all the merged vertices. As we trace back recursively, we will be eliminating one of the edges that were forming the cycle.

If you observe, we deleted ‘that’ →’ flight’ rather than the other way round. This is because, during our 2nd iteration, we found an edge between Book →’tf’(weight -1, this edge belonged to ‘book’ →‘flight’ originally). Hence, we already found an incoming edge for ‘flight’ & as we can’t have multiple incoming edges for a vertex in MST, we will be deleting that →flight.

It must be noted that

If we find an MST in the 1st step itself, you have got your answer. though if a cycle exists, proceed for further steps only then.
Here, we got our MST in the 2nd step but this can take longer iterations. For every iteration, we will be merging the vertices between whom the cycle is found & creating a new graph until an MST with no cycles is found
Once an MST is found, begin traceback to split merged vertices deleting at least one edge.

One thing you must be feeling missing!!
The relation_tags (conj, iobj, obj, etc). How to assign these tags to Head-Dependent relation found using the above method? This case is out of the scope of the blog & would require an entire post to explain & hence skipping for now.

We left a couple of questions in the first step !!

How is the score between 2 vertices calculated?

Its time to answer this. What we have is a trained model that intakes a Head-Dependent pair & outputs a score for this relation. It must be noted that we feed this model with every possible Head-Dependent pair that can be possible using which the graph G in step 1 is obtained.

For example: For ‘book that flight’, we feed the model with
book →that,
that →book,
that →flight,
flight →that,
book →flight &
flight →book.
Their respective scores are predicted & assigned to respective edges for graph G.

We have covered both Projective & Non-Projective dependencies. But as I told you these methods involve ml models (Oracle in shift-reduce parsing & Score predictor in MST), we must have some metrics to evaluate the system. Right?

First, we shall take up an example & then I will walk through a couple of metrics for evaluating parse trees.

Here ‘Reference’ is the ground truth & ‘System’ is predict parse tree.

Total Head-Dependent pairs in ‘Reference’/Ground_Truth =6
Total Head-Dependent pairs correctly detected with tags= 4 (Book →flight & flight →me aren’t accounted). If you notice, for book →flight, the Head-Dependent pair is correctly detected. Only the tag is wrong. While for flight →me, both are wrong
Total Head-Dependent pairs correctly detected (irrespective of the tag i.e the tag might be wrong but the Head & Dependent should be correct)=5 ( flight →me is a wrong pair. As we discussed in the above point, we will be considering book →flight in this count)

We have basically 2 major metrics:

Labeled Attachment Score (LAS):

It is the ratio of correctly detected Head-Dependent pairs along with their tag/ total Head-Dependent pairs in the ground truth.

Hence, 4/6=0.666

Unlabeled Attachment Score (UAS):

It is the ratio of correctly detected Head_Dependent pairs (irrespective of the tag)/ total Head-Dependent pairs in ground truth

Hence, 5/6=0.833

And with this, I finally sign off!!