Discourse Segmentation Model on Sentiment Analysis from “Neural Text Segmentation and Its Application to Sentiment Analysis”

Text Segmentation and Its Applications to Aspect Based Sentiment Analysis

Karahan Şahin
Artiwise NLP
Published in
10 min readAug 19, 2021

--

Index

  1. What is text segmentation?
  2. Usage Areas
  3. Algorithms
  4. Applications to Aspect Based Sentiment Analysis
  5. Related Datasets
  6. Evaluation Metrics

Language data is now everywhere on the Internet and it is not in an orderly manner. Think about the times when you are looking at tweets while you are scrolling through your feed or when you are looking at the reviews for the tech gear you want to buy for some time. They are usually full of punctuation errors. Even we humans sometimes have a hard time understanding where a sentence ends and another one begins while reading this type of content. Another example can be given from the “SEO-based” news articles, while the majority of the content is like “Find about that news more in …”, “This is talked about on the Internet a lot and ..”, what you want to find is only a sentence long. There is a solution for each of these problems. Text segmentation solves these problems by extracting an important subset of the content.

Text segmentation is the task of extracting relevant sub-units like words, phrases, sentences, and paragraphs from texts which can be utilized for many tasks. These tasks include:

  • Word Segmentation
  • Sentence Boundary Detection
  • Text Summarization
  • Sentiment Analysis

An important point of view when looking at the text segmentation task is that it is a sequence labeling task. The reason for that is when you look at other sequence labeling models, you will see that the labeling is carried through selecting a set of words that gives semantic information such as the type of object which token refers to or what kind of a phrase it is. From that point of view, considering a text with misused punctuations, a semantically dense part of a paragraph, or a subsentence in which there is a sentiment assignment, it can be said that labeling a subsequence is the task of text segmentation. As we explained its applications in text analysis for obtaining fine-grained and semantically dense sub-units, we’ll be investigating this feature in detail to adapt an Aspect Based Sentiment Analysis.

Background

The research on text segmentation goes back to the ’90s. The literature on segmentation starts with rule-based models. The algorithms try to capture the segments via recurring patterns in texts. Then probabilistic models are used. These models usually rely on the concept of lexical cohesion. Lexical cohesion is the concept in which related words are often in a close window span. Since topic segmentation is relevant in these models we see Latent Dirichlet Allocation in these algorithms. These algorithms adopt this approach to extract the highest possible set of words.

Popular Approaches in Early Work:

  1. TextTiling
  2. C99
  3. CVS
  4. Bayesseg
  5. PLDA
  6. Bayesseg-MD
  7. MultiSeg
  8. BeamSeg

These algorithms date back from 1997 until 2015. With the emerging use of neural networks, these algorithms have shifted towards neural models over time. With the idea of sequence labeling in mind, we see architectures such as CRF’s, RNN’s, Transformers, and even fine-tuned BERT models.

How to Adapt Segmentation Algorithms On ABSA?

Before we dive into different types of segmentation models, we need to talk about ABSA. There is scarce literature on the use of segmentation in ABSA literature. Therefore we need to adapt segmentation to this task as we discussed before. In Aspect Based Sentiment Analysis literature, there are 3 subtasks as

  1. ATE (Aspect Term Extraction) is the task to identify the target term(s) which the sentiment expressed about
  2. ACE (Aspect Category Detection) is the task of identifying the target category/topic the sentiment expressed about. This is usually for commercial usage.
  3. APC(Aspect Polarity Classification) is the task to identify expressed sentiment (positive, negative, and neutral) about the target term or category

The literature consists of two separate approaches, it either uses a joint model which carries ATE and APC at the same time or a linear combination of them (ATE then APC). In ATE and ACD, we utilize the knowledge of which topic the sentence is about. There we can use the topic segmentation literature for extracting the segments about the topic or we can use discourse segmentation to eliminate irrelevant sentences in data. These segmentation types will be explained later. In APC, however, we need sentiment-based segmentation.

Sentence Boundary Detection

These tasks are easy on the academic literature since the datasets are usually in an orderly manner, separated and labeled sentence by sentence. However real-life data is more complex than this. Since the human-generated data is full of mistakes it is a task itself to determine where a sentence ends and another one begins. Therefore we use Sentence Boundary Detection(SBD) in larger bodies of texts which contain multiple sentences. There are different types of SBD models for Chinese orthography or financial texts in literature but in real life, the majority of data is coming from social media.

Along with erroneous use of punctuations, we also see the use of punctuations such as “!!!” or “:)))” “ok…..” type of occurrences which cannot be dealt with regular expressions only. Although the majority of literature relies on rule-based algorithms as opposed to what we have discussed, some studies utilize architectures such as CRF, BiLSTM, BERT.

However, we cannot just rely on separating sentences for ABSA since there can be segments with no aspects or segments containing multiple aspects at once. Therefore we need segments that are filtered by relevancy or segments that are smaller than a sentence. This approach is referred to as Compositional Approach to Sentiment Analysis[2]. For example:

I like the ambiance of this restaurant. However the food was not that good
  • In this example, we can see that two aspects are expressed as the ambiance and food quality. The spans where both aspect and sentiment are divided via sentence boundary.
  • There, we can say that the sentence boundary detection is helpful for aspect-based sentiment analysis and we will be separating it as below:
"I like the ambiance of this restaurant." 
Aspect: ambiance
Polarity: positive

"However the food was not that good"
Aspect: food quality
Polarity: negative
  • However, if we encounter a sentence like:
I like the ambiance but the food was terrible.
  • There are two aspects expressed in one sentence. There is no sentence separator for them so how can we separate them?
  • There we must extract the sub-units that are smaller than sentences:
"I like the ambiance" 
Aspect: ambiance
Polarity: positive

"but the food was terrible."
Aspect: food quality
Polarity: negative
  • This approach looks deeper and deeper into a body of text to find a unit that is composed of one aspect and one sentiment for the analysis. Mainly, a compositional approach to extract aspect-sentiment pairs.
  • To apply the compositional approach in ABSA, we need categorical information to process the texts. There we can use two main information sources of ABSA which are the aspect information and sentiment information. We will be investigating how extraction can be done regarding these two information sources.

Aspect Polarity Based Segmentation

There are not many studies implementing this approach to ABSA. The focus of these studies is to extract the segment where the sentiment is assigned. These studies segment a body of text according to the polarity of sub-units. For example:

[I don't know.] [I like the restaurant] [but not the food.]
neutral positive negative

These studies heavily rely on the syntactic structure of sentences. There are rule-based studies that use heuristic rules to extract sentiment information. Such as:

  1. If a “sentiment-denoting” adjective as good before a noun food, then the segment is good food or
  2. If a “sentiment-denoting” verb as hate and objects as service and restaurant then the segment is I hate the service and restaurant

Further deep learning studies use Recursive Neural Networks to extract this set of rules in an unsupervised manner. However, these studies do not give any promising results in the long term. Therefore we mainly rely on our third approach that extracts segments by the topics.

Aspect Category Based Segmentation

This is the most prominent literature for ABSA. As we discussed before, Aspect Category Detection is used to detect the category of the aspect(s) in which the sentiment is expressed. With this information, we want to extract segments that are coherent in terms of topic and discard the ones that are not. Two types of segmentation can be applied for this premise:

  1. Topic Segmentation is the type of segmentation method to extract the topically coherent segments.
  2. Discourse Segmentation is the type of segmentation method to extract Elementary Discourse Units (EDU) and any other discourse units.

Topic segmentation is traditionally used for the segmentation of a large body of text such as an article or a book chapter to extract the parts with distinguishable differences in a topic. These studies are in interaction with discourse segmentation units. This is because the discourse segmentation extracts “category-independent” segments with linguistic sub-units from texts. To further demonstrate this:

SEGBOT Demo Output
  • This is an output from the SEGBOT model which gives clause-like units that serve as building blocks for discourse parsing and topic segmentation.
  • There we have the elementary units to evaluate the aspect information and sentiment information.
  • You can further look at the demo from here.

The literature on this started with TextTilling (Hearst, 1997) and was followed by many probabilistic, heuristic, machine learning, and deep learning approaches. Following studies might pave your way for a better understanding of topic segmentation:

  1. TextTiling is an unsupervised technique that makes use of patterns of lexical co-occurrence and distribution within texts.
  2. C99 is a method for linear text segmentation, which replaces inter-sentence similarity by rank in a local context.
  3. TopSeg is based on probabilistic latent semantic analysis (PLSA) and exploits similarities in word meaning detected by PLSA.
  4. TopicTiling modifies TextTiling with topic IDs, obtained by an LDA model, instead of words.
  5. BiLSTM-CRF is a state-of-the-art neural architecture for sequence labeling. The sequence labeling approach to segmentation is implemented in this model.

The later studies consist of deep learning approaches:

  1. SEGBOT uses a bidirectional recurrent neural network to encode an input text sequence initially. Then uses another recurrent neural network, together with a pointer network, to select text boundaries in the input sequence.
  2. BiLSTM-CNN uses CNNs to learn sentence embeddings. Then the segments are predicted based on contextual information by the Attention-based BiLSTM model.

There are several tools that we can utilize for our task, but how are we going to train our models? We provide the main datasets which are used in topic segmentation.

Related Datasets

These datasets are used for topic and discourse segmentation tasks. However, we imply that these datasets can be adapted to other segmentation tasks too. While the Choi dataset and Wiki-727K dataset are for topic segmentation, the RST-DT dataset is for discourse segmentation.

Choi Dataset: The commonest dataset used in the training of a segmentation model. It consists of 700 documents, each being a concatenation of 10 segments. The corpus was generated by an automatic procedure. A segment of a document is the first n (s.t. 3 ≤ n ≤ 11, 4 subsets in total) sentences of a randomly selected document from the Brown corpus.

The WIKI-727K Dataset: It is a collection of 727,746 English Wikipedia documents, and their hierarchical segmentation, as it appears in their table of contents.

RST-DT Dataset: This dataset is used in discourse segmentation models. The Rhetorical Structure Theory Discourse Treebank (RST-DT) is a publicly available corpus, manually annotated with Elementary Discourse Unit (EDU) segmentation and discourse relations according to Rhetorical Structure Theory. The RST-DT corpus is partitioned into a training set of 347 articles (6,132 sentences) and a test set of 38 articles (991 sentences), both from the Wall Street Journal.

The links are here:

Evaluation Metrics

Standard evaluation metrics which we see in the machine learning literature a lot (Precision, Recall, and F-1 score) are not applicable in this task. Yet there is a body of literature in which we see these metrics in their evaluation sections.

Pk is the probability that when passing a sliding window of size k over sentences, the sentences at the boundaries of the window will be incorrectly classified as belonging to the same segment (or vice versa).

The Formula of the pk metric

Windowdiff moves a fixed-sized window across the text and penalizes the algorithm whenever the number of boundaries within the window does not match the true number of boundaries for that window of text.

The Formula of the windowdiff metric

The important knowledge for these formulas is that the accuracy is higher if pk and windowdiff scores are lower.

The glossary for the formulas:

  • Total span δ()
  • window size k
  • total number of sentences N
  • the number of reference boundaries in the window from i to i+k, and
  • the number of computed boundaries in the same window C.

Useful Links:

References

  1. Pak, Irina, and Phoey Lee Teh. “Text segmentation techniques: a critical review.” Innovative Computing, Optimization and Its Applications (2018): 167–181.
  2. Kaur, J., & Singh, J. (2019). Deep Neural Network Based Sentence Boundary Detection and End Marker Suggestion for Social Media Text. 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS).doi:10.1109/icccis48478.2019.8974495
  3. Du, Jinhua, Yan Huang, and Karo Moilanen. “AIG Investments. AI at the FinSBD task: Sentence boundary detection through sequence labeling and BERT fine-tuning.” Proceedings of the First Workshop on Financial Technology and Natural Language Processing. 2019.
  4. C. R. Aydin and T. Güngör, “Combination of Recursive and Recurrent Neural Networks for Aspect-Based Sentiment Analysis Using Inter-Aspect Relations,” in IEEE Access, vol. 8, pp. 77820–77832, 2020,
  5. Kayaalp, Naime F., et al. “Extracting customer opinions associated with an aspect by using a heuristic-based sentence segmentation approach.” International Journal of Business Information Systems 26.2 (2017): 236–260.
  6. J. Li, B. Chiu, S. Shang, and L. Shao, “Neural Text Segmentation and Its Application to Sentiment Analysis,” in IEEE Transactions on Knowledge and Data Engineering.
  7. Badjatiya, Pinkesh, et al. “Attention-based neural text segmentation.” European Conference on Information Retrieval. Springer, Cham, 2018.
  8. Pevzner, L., & Hearst, M. A. (2002). A Critique and Improvement of an Evaluation Metric for Text Segmentation. Computational Linguistics, 28(1), 19–36.

--

--

Karahan Şahin
Artiwise NLP

Senior Linguistics Undergraduate from Bogazici University. Current working at Boun TabiLab as Lab Assistant and Artiwise as Data Science Intern.