Words All the Way Down — Conversational Sentiment Analysis

Bineet Ranjan
The PayPal Technology Blog
9 min readApr 25, 2022


At PayPal, as customer champions, we believe in leaving no stone unturned to delight our customers. Even when millions of contacts are established by customers day after day, either with bots or with our customer support agents, we strive to listen to the conversations deeply and from a place of empathy. By listening to what is said and what is left unsaid in those conversations, we re-imagine and reinvent our offerings to cater to our customers’ needs.

Conversations usually involve many speakers, speaker-level context, and inter-speaker dependency, rendering sentiment analysis of a conversation complex.

This blog explores challenges, methodologies, and datasets around conversation sentiments and how PayPal analyses sentiments in our customer support conversations.

People vector created by rawpixel.com — www.freepik.com

Business Use Case

Sentiment analysis is a popular Natural Language Processing (NLP) use case. It is used in product reviews and comments, but sentiment analysis is beginning to be applied to conversations and dialogues as well.

Conversational Sentiment Analysis helps to detect the polarity and emotion of speakers based on an ongoing interaction. Knowing how a customer feels during a conversation has multiple use cases in both offline and online modes.


  • Real-time or near real-time sentiment allows chat bots or customer support agents to provide appropriate and empathetic responses.
  • Timeline views of trends in the customer’s sentiments enable the customer support agents to review the customer’s prior sentiments and start a new session with adequate preparation and knowledge.


  • In offline mode, customer sentiment in a conversation helps analyze trends to measure bot or agent effectiveness by highlighting specific coaching scenarios and drawing insights about specific aspects of our products and services.


Sentiment in a conversation is more complex than in a movie or product review. A movie or a product review carries all information required to identify sentiment in a block of text. A conversation comprises of two or more speakers and the sentiment of each speaker (or whole conversation) depends on what each one is saying and in what context. As compared to a review, the sentiment of a conversation keeps changing as each speaker contributes. Hence finding the sentiment of a conversation needs to consider what each speaker is saying as well dynamic changes which occur due to contributions from each speaker.

There are three important aspects which need to be considered and accounted for while building a model to identify sentiment of a conversation:

A typical conversation has two or more speakers with each one providing one or more utterances. An utterance is a single block of information provided by a speaker before the next speaker begins.

Sentiment in a conversation depends on three important aspects:

  • Utterance context: Representation of an utterance should consider what was said before and in what sequence
  • Speaker context: Differentiation between speakers and considering the theme of what each speaker is talking about
  • Inter utterance dependency: The current utterance or response to previous utterance or a general statement
Figure 1: Source: https://arxiv.org/pdf/1908.11540.pdf

In figure 1, the sentiment of responses provided by the bot depends on the response provided by the user. It also depends on the overall context of the conversation.

Another example to highlight the significance of context:

Agent: Are you still getting the error?

Customer: No, I do not get it now.

Without context, customer response may be classified as negative.

Sentiment can be understood by focusing on context, temporal aspect of utterances, and dependency on speakers. Model architectures to predict sentiment must consider these aspects.

Model Architectures

This section focuses on model architectures to predict conversational sentiment — it starts off with a simple “without context” architecture, then discusses “contextual” models, and ends with a model for asynchronous conversation.

As with most of modern Natural Language Processing architectures, the models discussed in this section have the following layers:

  1. Text Encoding Layer: Text data needs to be converted to numerical values for algorithms to make sense of it. This process is called Text Encoding. There are multiple ways to encode text into vectors (ordered numerical representation) like one hot encoding, TF-iDF (Term Frequency Inverse Document Frequency), Word Embedding, etc (read more at Text Encoding). The latest addition to this list has been Transformers-based approaches like BERT, ROBERTa, SentBERT etc. (Read more about BERT). This is what has been used as an encoding layer for the following approaches.
  2. Attention Layer: Attention is a method to associate segments of input with segments of output. For example, in a translation use case, attention helps the model understand which part of an input sentence impacts which part of a translated sentence. In an analogous way, attention helps understand which part of text contributes to negative or positive sentence. (Read more about the Attention Mechanism).
  3. SoftMax: The SoftMax function coverts a list of numbers to values which add up to one, thus enabling us to utilize the returned values as probabilities. This is useful in a classification use cases like sentiment analysis as the highest probability is returned as a predicted class.

With the above background, let us jump into various approaches to solving conversational sentiment analysis.


The non-contextual model, as displayed in Figure 2, is a simple architecture to predict the sentiment of each utterance and then aggregate them to arrive at speaker-level sentiment.

Figure 2 : Non contextual architecture for conversation sentiment

Each utterance “u” is passed through a BERT-based encoder to get vector representation “v”, which is passed through SoftMax to get the sentiment score of each utterance. To arrive at speaker-level sentiment, we use aggregation logic like:

  • Last n polar – This approach is a polarity-based weighted average of the last n sentiments. Negative and positive sentiments (polar) should carry more weight as compared to neutral ones to arrive at an overall speaker sentiment.
  • Temporal Decay – This approach attributes more weight to recent utterances. Recent utterances display the latest mood of the speaker and should weigh more.

Key points about this approach:

  • It does not consider conversation structure, speaker, or temporal impact
  • It assumes all utterances are independent of each other
  • It needs a specific aggregation strategy to arrive at speaker-level sentiment


Contextual models have the following key attributes:

  • It considers the speaker and temporal impact
  • It identifies the relation between utterances at the speaker level
  • It ascertains the impact of one speaker on other speakers in the conversation
Figure 3 : Architecture for contextual sentiment detection

These models start with BERT-based vector representation just like non-contextual models. In non-contextual models, we utilized specific aggregation strategies to arrive at the final sentiment of the speaker. But in contextual models, we want to encapsulate the following conversation properties and not just rely upon deterministic aggregation:

  • Utterance context
  • Speaker context
  • Inter Utterance Dependency

To include the above three contexts, the model needs to have something much more than just text encoding. This is where techniques like Graph Networks and COMET are utilized (discussed further in detail).


DialogueRNN is an Attentive RNN (Recurrent Neural Network) model which defines speaker-level encoding using speaker state, previous utterance context, and previous utterance emotion.

It considers individual speakers by focusing on three aspects:


• Context of preceding utterance

• Sentiment of preceding utterance

The final emotion of an utterance is determined by these states:

Speaker state

  • Models the speaker’s emotional state during the conversation
  • Ensures the model is aware of the speaker of each utterance

Global state

  • Models the context of the utterances by jointly encoding the preceding utterance and speaker state
  • Represents a speaker-specific utterance

Emotion Representation

  • Combines the speaker state and global state
  • Performs the final emotional classification


DialogueGCN approach utilizes the GCN (Graphical Convolution Network) to establish the relationship between utterance, speaker, and listener by forming a directed graph between various utterances and keeping account of the order of utterances.

DialogueGCN represents each conversation as a graph

  • Nodes => Utterances
  • Edges => Based on the context window
  • Edge Weights => Importance of connection between two nodes (attention block)
  • Edge Relations => Defines the type of connection

For an edge between utterance 1 and 2, we can identify:

  • Speaker dependency: who spoke the utterance 1 and utterance 2
  • Temporal dependency: is utterance 1 first or utterance 2


One of the latest in conversation emotion detection; COSMIC utilizes a generative model (COMET) to generate common knowledge to create speaker-level encoding.

Common sense feature extraction (COMET – Transformer based knowledge graph) helps in finding:

  • Intent of speaker
  • Effect on speaker
  • Reaction of speaker
  • Effect on others
  • Reaction of others

Sentiment Analysis on Asynchronous Conversations

At PayPal, we handle asynchronous conversations – conversations that span across bots and multiple agents over various segments. Sentiment analysis of such conversations is challenging as it involves multiple agents, and each conversation will involve a different sentiment and impact the next interaction in a unique way.

Figure 4: Asynchronous Conversation

To model the above conversation structure, we are working on a comprehensive model architecture which can combine both structural information and contextual information. A hierarchical attention layer is used to aggregate sentiments and representations to arrive at uber level representation.

Figure 5: Architecture to handle asynchronous conversation

As we tried to label data and build models at PayPal, we encountered some curious challenges which inspired us to reexamine our thought process. While diving deeper into sentiment analysis, we figured that our use case needed labelled data. To overcome the challenge, we employed various methods like Active Learning, Weak Supervision, and Human Annotation, and we observed a considerable amount of variation in assigning sentiment to an utterance or conversation among various human labelers. An utterance was perceived in multiple ways by different labelers and the subjectivity of the process made the exercise even more fascinating. We tried to bridge the differences by using voting-based labels and it made us wonder about how model predictions shift shapes based on who observes them. Such nuances make our projects meaningful at PayPal as we consciously create inclusive products and democratize financial services.


Popular Data Sets

To build superior quality models, the availability of superior quality labelled data is most important. Having custom labelled data is a luxury which may not be always available – hence utilizing open-source data sets becomes quite important.

The following are some popular conversation data sets which were used by research publications, and we found it interesting at PayPal as well.


The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal, and multi-speaker database. 12 hours of audio-visual data annotated by multiple annotators into categorical labels, such as anger, happiness, sadness etc. (Link to the data set)


13,000 utterances from 1,433 dialogues from the TV series Friends. Each utterance is annotated with emotion and sentiment labels, and encompasses audio, visual and textual modalities. (Link to the data set)


Human written multi-turn dialog data set. Represents daily communication. Manually labelled with communication intention and emotional information. (Link to the data set)


Data set based on the popular TV show called Friends. Transcripts for all 10 seasons of the show as well as manual and crowdsourced annotation for sub parts of the show are provided. (Link to the data set)


Manually labelled 2,214 multi-speaker English conversations collected from various websites that provide online communication services. (Link to the data set)