Source: https://pixabay.com/illustrations/network-web-digital-5113917/
Source: pixabay

Data Augmentation for ABSA

Maria Obedkova
TrustYou Engineering
10 min readOct 4, 2021

--

With the rise of foundation models in NLP, downstream NLP tasks became very accessible for industrial use. However, this NLP advancement outlined a new demand in the development of NLP solutions — the demand for lots of data.

Since data is not something that is freely lying on the road, Data Augmentation (DA) came under the spotlight in recent years and became a part of the NLP development cycle in many industrial applications. The variety of DA approaches, the emerged and non-stopping DA research and open-sourcing of DA techniques made it a desirable part of many ML solutions.

The Sentiment Analysis field is no exception to this trend. When data is a valuable resource, developers are trying to get bits of it whenever and wherever is possible ensuring improvements to sentiment analysis models.

In this blog, we will talk about data augmentation in general and will discuss how to tactically approach data augmentation for the Aspect-based Sentiment Analysis task. This blog is intended for people who are already familiar with NLP and somewhat interested in the Sentiment Analysis domain.

What is Data Augmentation?

Data augmentation (DA) refers to strategies for increasing the diversity of training examples without explicitly collecting new data. [1]

As footnotes to this definition, the following important points can be added:

  • These strategies should aim at improving NLP model performance.
  • The distribution of augmented data should aim to follow the real-world distribution of data for a specific modelling task.

The first point is self-explanatory: the whole idea of performing some transformations on training data came to life because it potentially helps to improve modelling performance. The NLP community consensus about DA is that it acts as regularization that helps a model to generalize better, thus, improves its quality.

The second point highlights the peculiarity of statistical models: on one hand, if the augmented data distribution is too similar to original data, models overfit to this strong data signal; on the other hand, if distributions are way too dissimilar, it might lead to underperforming models because they fail to generalize.

On the historical note, advances in data augmentation mainly come from the Computer Vision field which is more flexible with adopting different data transformations like flipping, cropping, colour jittering and so forth. For NLP, solutions for data augmentation are not that intuitive and research in the NLP domain for DA is sparse, however, the need for DA successfully motivates the research. Apart from that, not all NLP tasks are created equal: DA for some tasks is easier than for others.

In this blog, I will concentrate on DA techniques that potentially could help in Sentiment Analysis tasks, especially in Aspect-based Sentiment Analysis (ABSA). But for that, we need to first discuss the ABSA task and its important characteristics that we need to account for in DA.

What is ABSA?

Aspect-based Sentiment Analysis is an NLP task that aims at detecting and categorizing aspects in data and identifying sentiment attributed to each detected aspect. ABSA implies two main components:

  • Aspects: specific topic about which is being talked about, also can be seen as category, feature, idea, theme, target, etc.
  • Sentiments: positive or negative (or neutral) opinions about a particular aspect

ABSA can be split into three sub-tasks that are normally performed as a pipeline of tasks or simultaneously:

  • Aspect detection: detection of a text fragment that covers a specific topic (and bears sentiment)
  • Aspect classification: classification of a defined aspect fragment into a predefined set of aspects
  • Sentiment classification: identifying sentiment for a defined aspect fragment or good old sentiment analysis task

Aspect detection can be performed completely silently or explicitly via span extraction — extraction of a text span that outlines a specific topic and bears some sentiment.

ABSA + DA

The nature of the ABSA task suggests that performing DA for this kind of task should be done cautiously. The main reason is a span-based nature of a task: when extracting continuous spans, different language phenomena come into play. The DA goal becomes difficult to attain: now you should work harder in order not to introduce unnecessary or even harmful artefacts to an augmented sample.

Another reason is the combination of different sub-tasks into one main ABSA task. When augmenting data for one task, you need to ensure that after augmentation, one specific task is performed correctly and DA doesn’t influence the task in a bad way. Here we sometimes deal with three different and mainly separate tasks — span extraction, aspect and polarity classification. And even if the tasks are performed simultaneously, you still work with three conceptually different tasks. So, in this case, you need to account for specific features of each task which sums up into a big list of different constraints for DA.

Let’s discuss these constraints. As an example sentence, we can take “I enjoyed staying in this hotel but the location is bad”.

The first thing to care about in DA for ABSA is not to distort polarity by the data transformation. For example, by utilizing the distributional semantics approach for DA, we potentially accidentally could augment a data sample with an antonym word getting “I disliked staying in this hotel but the location is bad” while aiming at getting a word with a similar meaning and similar polarity. This is just one of the possible “what can go wrong” things but it illustrates how the whole idea of sentiment analysis can fail if you don’t constrain DA properly. A bunch of additional checks to ensure meaningful DA for this task or careful choice of DA techniques is required here.

Not hard to guess that the same idea applies to aspects. DA should not change the aspect of an example by its transformation. “I enjoyed staying in this hotel but the venue is bad”. This could happen after applying the thesauri-based DA since “location” is a close synonym of “venue”. For your end task, the difference might be huge, let’s say if you classify into “Location” and “Facility” aspects. As for polarity example, aspect detection will be done poorly in this case and additional restrictions and checks on DA examples will be required.

And finally, if you are interested in the extraction of text spans (mainly concerns spans of more than one word), you have another set of restrictions to consider. Text spans are quite fragile because changing spans even a little might make previous assumptions about these spans no longer hold. Additionally, mainly spans are bearing sentiment and aspects. Thus, not only sentiment and aspect should be preserved but linguistic features in spans should not be distorted. Here it mainly depends on your data source or data domain which characterizes what kind of spans are allowed. Should spans be grammatically correct? Then, you need to care about not distorting the syntax that happens in spans in your data and always keep spans grammatical. “I enjoyed stayed at the hotel but the location is bad” is quite a probable outcome when replacing “staying” with distributional semantics while you could be aiming at getting something like “I enjoyed stopping at the hotel but the location is bad”. If your model is not used to ungrammatical structures, this DA might deteriorate the performance. This is one of the most vivid cases but there also can be some undesired span changes to morphology and semantics of its components, length, presence of functional words, internal rules based on which spans are extracted and what else not. DA might influence dependencies within a sentence like coreference chains or cause problems for the next or previous sentences.

What are the DA techniques that can be used for the ABSA task?

DA can be performed on inputs or features, the latter is hard to control so is less preferable. The safe strategy is to augment actual inputs and get more input data that can be investigated manually if necessary.

Rule-based techniques

First, let’s talk about rule-based techniques. They are mainly heuristics that help to come up with more data samples.

EDA or easy data augmentation

EDA is probably the most popular and simplest DA technique. EDA is comprised of random word insertion, deletion and swap. Being a pretty simple transformation, we should care about not influencing spans much which suggests applying it mainly to out-of-span words. We potentially could operate on the level of phrases for swapping and move a continuous span around a sentence (“I didn’t like {their not organized staff}” → “{Their not organized staff} I didn’t like”). A good point to notice is that for classification tasks, swaps are generally beneficial even if they turn out gibberish and help to improve results but for span extraction tasks, swaps might turn out harmful.

Token manipulation

Along the lines, different kinds of token manipulations based on external sources can be mentioned. WordNet, language dictionaries, thesauri, in-domain resources, TF-IDF score, etc. can be used for replacing tokens with synonyms, antonyms, hyponyms, hypernyms, closely related words and so on. You can inject some noise, either random as in EDA or “smart” noise like misspellings. Again, depending on the kind of task you are performing, you might have a bunch of constraints like the already mentioned untouchable nature of spans. For example, if you manipulate tokens on the span level, you could inject noise for spans: “I didn’t like {their not organized staff}” → “I didn’t like {their notorganized staff}”.

Syntax-aware DA

If you want to go hardcore or you have a dependency annotated dataset, syntax-aware DA can be performed. You can either split sentences into different head+dependents (can be seen as a way to crop or trim sentences in a meaningful way) or swap sentence parts adhering to syntactic rules. For example, “I didn’t like {their not organized staff} and {dirty floors}” can result in “I didn’t like their not organized staff”, “I didn’t like dirty floors” and “I didn’t like dirty floors and their not organized staff”. This is the safest and quite efficient strategy to augment data for the span extraction task.

Model-based techniques

Next, we will cover model-based techniques. They are normally more scalable and produce more impact but are more problematic for span-based tasks. These techniques proved to be efficient for classification tasks but ABSA has quite a few constraints so these techniques should be applied with caution.

Backtranslation

For backtranslation, you need to choose the pivot language, translate your data to this language and back to the original language. Thus, you will most probably get quite diverse data since translations likely won’t be identical to original sentences. This is a pretty popular and successful DA technique if you can get translations easily. If you perform this DA for span extraction, you might consider backtranslate only a span part mainly because while doing DA on the whole sentence, you will have a hard time restoring span boundaries. For example, “I didn’t like {their not organized staff}” → “ihr unorganisiertes Personal” (de) → “I didn’t like {their disorganized staff}”. This way, you can restore boundaries but might have issues with grammar for some languages. Thus, tradeoffs are required.

Paraphrasing

Paraphrasing is also a quite popular technique for DA. Normally, it works with the help of paraphrasing models. They do not necessarily know how to paraphrase for your specific domain but showed to bring performance up for various NLP tasks. It has the same considerations as backtranslation: sentence-level paraphrasing is less advisable than span-level one. Backtranslation, in fact, can be seen as an instance of paraphrasing.

Embedding-based DA

Pretrained embedding models possess a lot of knowledge about how words, phrases or sentences relate to each other by their meaning. This can be used to augment data. We can use word, phrase or sentence embedding models to transform data, luckily they can be easily trained for a specific domain or found on the web.

Contextualized embeddings of big LMs are also a possibility and can work out even better. Augmenting whole phrases potentially could help for the span extraction task and word replacements outside of spans could bring a diversity of contexts.

Masked LMs also can be used as a way to augment the data. You may want to create masks for specific parts of the input so that the model tries to fill the gap in (“I [MASK] their dining area”), however, it is not that constrained (any probable word could go there) which might pose additional problems. If you go this way, try at least to avoid content word masking and maybe stick to functional words (“The hotel [MASK] nice”).

Generation-based DA

We know about advances in generative models like GPT models. We could use this powerful tool to augment the data by giving the initial sequence of words to a generative model. However, unconditional augmentation is tricky, not constrained context in generative models can produce almost any sequence of words. Models conditioned by label (polarity or aspect) could cope with the task better. However, we still have the risk to produce extra (unannotated) spans by this exercise though.

In the Resources section below, you can find a selected list of python libraries that exist for DA in NLP and cover some of the DA techniques discussed above.

Final thoughts

Generally, DA can be composed of various techniques which could potentially interact with each other. This might make it difficult to handle or to obtain augmented examples but you can perform augmentation on a data subset, f.e. on misclassified examples, infrequent class or underrepresented data. You also might involve humans to augment some data samples or incorporate domain knowledge in your DA via dictionaries or ontologies.

Whichever DA technique you use, it is better to be cautious about how you influence your spans, polarities and aspects. Come up with a set of tests and filters to detect false transformations: measure distance between representations, consider how much of polarity and aspect change you allow, check for specific changes in the transformations, constantly pulse check with your model performance, sometimes do some manual quality assurance of your transformations, etc.

Obviously, you can test your DA approach and figure out by looking at the model performance if DA is helpful or not. But in my opinion, it is profitable to know underlying reasons for why particular DA techniques are not beneficial or even harmful for your specific NLP task. And I hope this blog was somehow helpful for those working in the Sentiment Analysis domain and considering data augmentation.

--

--

Maria Obedkova
TrustYou Engineering

NLP | ASR | AI | R&D | writing and illustrating as a hobby