A Survey: Deep Semantic Role Labeling: What Works and What’s Next

by Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. Aug 4, 2017

12 min readMay 4, 2020

Semantic Role Labeling (SRL) is believed to be a critical task for natural language understanding. In the year 2017, much efforts had been devoted to create an end-to-end model, well-suited for different downstream tasks such as SRL in the field of Natural Language Processing (NLP).

Building on top of those works, this paper presents a novel model, a deep highway BiLSTM architecture (figure 6) with constrained decoding method. In test time, this model pushes the state of arts performance on CoNLL 2005 and CoNLL 2011 datasets (raises 2.1 F1 score in both datasets) and reduces 10% relative error over the previous reported results.

Besides stating the model’s state-of-the-art performance, the authors also provide detailed error analysis. Despite previous works seem to suggest that syntactic information might no longer benefit SRL, this paper reasons that those structural information could benefit the consistency and, eventually, performance of the model.

1. Takeaways, what does the paper aim to show?

1. Deep language models (e.g. 8 layer BiLSTM) give better performance at retrieving long-distanced relations but are not always consistent in their results
2. Adding Syntactic parsers might help to push these results further

In terms of performance, the paper combines the best practice of semantic role labeling task at that time, including highway connections, RNN dropout, and orthogonal initialization of the word embeddings. With those techniques, the model was able to outperform previous state-of-art results and raises F1 scores on CoNLL 2005 and CoNLL 2012 by 2.1.

Especially, compared to previous reports, this paper improve the percentage of completely correct predicate by 5.9 points, indicating a strong ability of the deep BiLSTM model to predict semantic arguments.

2. SRL, highway networks

(a) Semantic Role Labeling (SRL)

Given a sentence, we wish to identify the activity, objects, and their corresponding roles for that sentence. For instance, this is an example given by Mark Yatskar in his presentation of situation recognition.

Given a sentence, such as:

Figure 1. A Possible Sentence for SRL Task [1]

The task of Semantic role labeling recovers predicate structure of a sentence, providing representations that answer basic questions about sentence meaning, in the form of “who” did “what” to “whom”.* If we use a table to represent such a finding, figure 2 shows the result.

We wish to identify such a structure for the activity “falling”:

Figure 2. A Structure Prediction of the Activity “Falling” [1]

In such structure, the semantic relation of each objects are identified according to the verb (i.e. activity). This task is crucial as it reveals the ability of a model to not only parse its syntactic information but also understand objects’ latent meanings in context of a sentence.

(b) Highway LSTM

Highway networks differ from plain networks in their information flow. In a lay-man term, these model allow for information to take short-cut and directly transmit the original latent features directly as other layers’ input.

Equation 3. A Simplified Version of High-way Connections [2]

More precisely, highway connects utilize 2 transform functions T and C, which are complement of each other (i.e. T = 1-C), where T is the transform gate and C is the carry gate. If we look at T = 1, interesting information flow appears as :T=0, C = 1, y = 0 + x * 1 = x

In this case, the network would pass the input as output directly and thus the model is referred to as “highway” connections. For a detailed introduction, referred to the review of high way connections: Review: Highway Networks — Gating Function To Highway (Image Classification).

3. Methods

(1) SRL Task Definition

The task is to predict a sequence y given a sentence-predicate pair (w,v) as input. Given that each y^i ∈ y belongs to a discrete set of BIO tags T, the paper wishes to find the sequence y (structural semantic role prediction) that offers the highest tagging score f(w, y).

Equation 4. Definition of Semantic Role Labeling Objective [3]

Words outside argument spans have the tag O, and words at the beginning and inside of argument spans with role r have the tags B_r and I_r respectively. Let n = |w| = |y| be the length of the sequence.

The tagging scores come from two parts. First, BiLSTM learns a local decomposed estimation of the score log p(y_t | w), where y_t represent the tag assigned for timestep t. To account for other information such as syntactic information and structural consistency, the researchers augment the objective function by constraint c:

Equation 5. Semantic Role Labeling Objective Augmented with Constraints c[3]

(2) Deep BiLSTM Model

Figure 6. Highway LSTM with four layers. Highway LSTM with four layers. Curved connections: highway connections; plus symbols: transform gates.

The authors builds a 8 layer stacked BiLSTM and allows for information flow by the highway connections. As previous RNN based models perform poorly on long term prediction, LSTM accounts for remembering crucial previous input and thus more consistent for language understanding. Building on top of that, this paper builds highway BiLSTM in a interleaving manner, allowing the top cells to receive information parsed by LSTM in different layer and directionality.

Equation 7. Interleaving High-Way Networks. The first layer input x1_t is the concatenation of word embedding and a binary indicator of where the current t is the predicate in original sentence.

x_l,t is the input to the LSTM at layer l and timestep t. δ_l is either 1 or −1, indicating the directionality of the LSTM at layer l.

Equation 8. BiLSTM functions, subscript l,t: at layer l and timestep t; i: input gate; o: output gate; f: forget gate; c^~: new cell state; c: final cell state

Along with the standard declaration of BiLSTM (for more information about LSTM, check out CS 224D: Deep Learning for NLP1, the authors also incorporate high-way connections and recurrent dropouts.

By high-way connection, the model controls how much nonlinear transformation applying before the information is passed to the next layer. In here, r_l, t is the transform control gate and works similarly as the previous example T. The larger r is, the more nonlinearity we introduce into the output.

Equation 8. High-way Connections Formulas. r: transform control gate; h’: new hidden representation; h: final hidden representation

For recurrent dropouts, the paper adopts a shared mask z_l to blur a portion of the hidden states and shares the mask across that layer.

(3) Constrained Decoding

Refer back to the objective function, this paper also models a good sequence in terms of possible structural information. This is taken cared of by using A* search to scan over prefixes when decoding the representations.

The score is indicated in equation 5, a summation based on local scores and constraints applied. The heuristics that is used is A = f(w, y1:t) + g(w, y1:t), where g(w, y1:t) is the summation of scores with best possible tag after timestep t.

Equation 9. An Admissible A* Heuristics to Model Dependencies between Outputs

In the paper, the authors consider BIO Constraints (reject non-valid BIO transition), SRL Constraints (such as unique roles, continuation roles, and reference role), and Syntactic Constraints (enforcing consistency with a given parse tree) for C set in equation 5.

(4) Predicate Detection

As in many downstream tasks the gold predicate information might not be available, this paper also construct models with predicate detection and those using gold predicate as part of the input. The prediction is done through a binary softmax over the output of a BiLSTM to find out whether one is the predicate or not.

4. Experiments and Results

(1) Experiments Setup

The paper uses CoNLL 2005, CoNLL 2012 datasets and follows the traditional train-development-test split for both. The model is a 8 BiLSTM model, initialized by orthonormal matrices as in [8]. As only employing BIO constraint is observed with significant improvements, the experiment only uses this constraint. The ensemble model is done with using a product of experts from five models, each is trained on 80% of the training set and validated on 20% of the rest.

(2) Results

Table 10. Experimental results on CoNLL 2005, in terms of precision (P), recall (R), F1 and percentage of completely correct predicates (Comp.). The report is based of the paper’s best single and ensemble (PoE) model and compares to previous state-of-the-art (see detail in paper)*

Table 11. Experimental results on CoNLL 2012 in the same metrics as above. The author compare their best single and ensemble (PoE) models with previous state-of-the-art (see detail in paper) *

In summary, the ensemble (PoE) has absolute improvement on both datasets. More worth-mentioning, the percentage of completely correct predicate improves by 5.9 as shown in table for CoNLL 2012 and 5.2 for CoNLL 2005.

Figure 12. Ablation Study on the Techniques Applied to the Models

In the ablation study, the authors take out some techniques applied to the model and observe the change in performance. In figure 12, it appears that orthogonal initialization is surprisingly important as without it, the model achieves only 65 F1 score within first 50 epoches.

Table 13. Predicate detection performance and end-to-end SRL results using predicted predicates. ∆ F1 shows the absolute performance drop compared to our best ensemble model with gold predicates.*

For the predicate detection experiments, while the detector is able to achieve 97 F1 for CoNLL 2005, it only makes it to around 90 for CoNLL 2012. The authors ascribe the reason of such difference to the existence of nominal and copula predicates in CoNLL 2012 which makes predicate identification more difficult.

5. Analysis

Using error breakdown method proposed by [9], this paper conducts comprehensive error analysis. Here is some worth-mentioning part.

Figure 13. Performance after doing each type of oracle transformation in sequence, compared to two strong non-neural baselines. The gap is closed after the *Add Arg.* transformation, showing how this approach is gaining from predicting more arguments than traditional systems.*

Compared to previous result, this paper shows that it maintains better performance until the fixation of adding argument. This implies the approach of predicting more argument than previous methods is beneficial to the learning and should be encouraged.

Table 14: Oracle transformations paired with the relative error reduction after each operation. All the operations are permitted only if they do not cause any overlapping arguments. *

Based on the error analysis, the authors trace different parts that the system fails at. The most prominent error is label confusion, where the predicated span is an argument but the role is incorrectly predicted. The author argues this is majorly due to the argument disjunct distinction is instinctively hard. The machine might tend to ascribe the second argument to function like location and direction.

The other observed mistake is attachment error (62% errors in this category come from prepositional phrase (PP) attachment), which also is another difficulty in language analysis.

For structural inconsistencies, the authors detail the analysis on BOI constraint and SRL constraint. For the first part, this paper shows that deeper models give less violations but still suffers from ambiguity in data. In the second part, the models do not benefit much in this setting. The possible reasons might be that the models already encode those information or the hard constraints might even hurt the performance.

Last but not least, the authors discuss whether syntactic information could further help this model as it does not currently explicitly model this information. With gold syntax, F1 score increase by 2 percent while some additional information from state-of-the-art model could decrease the performance. This suggests the impact of syntactic information on SRL prediction.

6. Conclusion

To summarize, the authors emphasize three major contributions of the paper:

Deep model (8 layer BiLSTMs) that incorporate d with some of the best practices in deep learning renders quality results in SRL task.
Current deep SRL model performs well in predicate-argument relation compared to its non-neural counterparts. However, the ability to maintain syntactic consistency is still wanted.
With the addition of gold syntax, the model offers an increase of 3 F1 score , indicating syntactic information is still useful in SRL task.

7. Related Works

[1] End-to-end learning of semantic role labeling using recurrent neural networks.

To the year of 2015, previous state-of-the-art models for SRL are based on parsing results. This paper attempts to build an end-to-end SRL model without syntactic information and still maintains good performance. In the end, the model beats previous ones on CoNLL 2005 (F1 = 81.07) and CoNLL 2012 (F1 = 81.27) while being computationally cheap. While this paper suggests SRL might be able to construct semantic roles for a sentence without knowing its structural information, this paper argues against it, provides a better baseline in SRL, and reasons that additional parsing results might help to even further this improvement.

[2] The importance of syntactic parsing and inference in semantic role labeling

This paper presents a general framework for Semantic Role Labeling. The model adhere to the most general approach for SRL, following the sequence of pruning, argument identification, argument classification, and inference. As the framework takes into account for linguistics and structural constraints, the paper studies the influence of adding syntactic information to SRL tasks in detailed and is one of the very first paper that argues so. Following this line of research, many works follow such as [1] and this paper, providing different experiments and argue if syntactic information is the most critical part of a successful SRL model.

[3] Deep contextualized word representations

This paper introduces a novel way of word representation that allows the learning of context-dependent word embeddings. With a pre-trained BiLM, the method presented in paper uses the vectors learnt from the model and encodes each token as a function of the input sentence. Before this proposal, pre-trained word-embeddings are most widely used and the fixed representation might lead to errors in different contexts. Solving this problem, ELMo makes the context-dependent word-embeddings possible. With ELMo, many NLP downstream tasks are improved including SRL for our presented paper.

[4] Deep Semantic Role Labeling with Self-Attention

As semantic role labeling is proved to be critical task for language understanding, this paper, along with many previous works, is designed to produce an end-to-end SRL model. While the paper presented in this survey improves on previous result by some novelty in model architecture, it still suffers from long sequence dependencies as a shared disadvantage of plain RNN models. To be more specific, RNN models suffer from memory compression (i.e. have to store information of a long sequence in a fixed dimension). In this paper, the authors address the problem by introducing self-attention into the model and allowing two tokens, in spite of their distance, to form direct connections. With this design, this paper improves on our presented paper and achieve state-of-the-art on CoNLL 2005 and CoNLL 2012.

[5] Parser showdown at the wall street corral: An empirical investigation of error types in parser output

In the year of 2012, most evaluations on parsers only employ a single metric F-Score, which displays no intuitive analysis on the error that remains. In this paper, the authors use tree transformation to classify errors based on lingustic meanings. Through this way, the deficiency of the models is revealed in a more interpretable manner and could be tackled accordingly. Therefore, this method is widely employed to demonstrate what construction is difficult for parsers. In our presented paper, it also use this technique to breakdown the error and conduct a more detailed survey of possible pitfalls of their current model.

8. Work Cited

[1] Mark Yatskar. 2016. Visual Semantic Role Labeling for Image Understanding. http://markyatskar.com/situation_oral.pdf

[2] Sik-Ho Tsang. 2019. Review: Highway Networks — Gating Function To Highway (Image Classification). https://towardsdatascience.com/review-highway-networks-gating-function-to-highway-image-classification-5a33833797b5

[3] Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics 34(2):257–287.

[4] Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural net- works. In Proc. of the Annual Meeting of the As- sociation for Computational Linguistics (ACL).

[5] Luheng He, Kenton Lee, Mike Lewis, and Luke S. Zettlemoyer. 2017. Deep semantic role labeling: What works and what’s next. In ACL.

[6] Strubell, E., Verga, P., Andor, D., Weiss, D. and McCallum, A., 2018. Linguistically-informed self-attention for semantic role labeling. arXiv preprint arXiv:1804.08199.

[7] Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L., 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.

[8] Andrew M Saxe, James L McClelland, and Surya Gan- guli. 2013. Exact solutions to the nonlinear dynam- ics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120 .

[9] Jonathan K. Kummerfeld, David Hall, James R. Cur- ran, and Dan Klein. 2012. Parser showdown at the wall street corral: An empirical investigation of er- ror types in parser output. In Proc. of the 2012 Con- ference on Empirical Methods in Natural Language Processing (EMNLP). pages 1048–1059.