Ethics Sheet for Automatic Emotion Recognition and Sentiment Analysis

55 min readJul 5, 2021

On the left: A lighthouse in a stormy sea. We can see a ship in the distance. Rest of the image: a word cloud of emotions such as love, hurt, and fear. — *Lighthouse illustration from Hill’s Album of Biography and Art* from 1882 by Thomas E. Hill. Source: Wikimedia. **Within-page Navigation.** Sections of AER sheet: Modalities & Scope, Task, Applications, Ethical Considerations
Five sections of Ethical Considerations: Task Design, Data, Method, Impact, Privacy & Social Groups

Heads up: This sheet is long! A 12 minute read gives a good overview. You can jump to individual sections or bullets as needed. A summary card is available.

Papers:

Ethics Sheet for Automatic Emotion Recognition and Sentiment Analysis. Saif M. Mohammad. arXiv:2109.08256. Sep 2021.
Ethics Sheets for AI Tasks. Saif M. Mohammad. arXiv:2107.01183. July 2021.

Preface

Emotions play a central role in our lives. Automatic Emotion Recognition (AER) — or “giving emotional abilities to computers” as Dr. Rosalind Picard described it in her seminal book Affective Computing— is a sweeping interdisciplinary area of study exploring many foundational research questions and many commercial applications. However, some of the recent commercial and governmental uses of emotion recognition have garnered considerable criticism (see this; and this). Even putting aside high-profile controversies, emotion recognition impacts people and thus entails ethical considerations (big and small). This ethics sheet is a critical reflection of this broad field of study with the aim of facilitating more responsible emotion research and appropriate use of the technology.

An Ethics Sheet for an AI Task is a semi-standardized document that aggregates and organizes a wide array of ethical considerations for that task.
(See this post that introduces Ethics Sheets for AI Tasks and also discusses the motivations and benefits.)

Process: My own research interests are at the intersection of emotions and language — to understand how we use language to express our feelings. I created this sheet to gather and organize my thoughts around responsible emotion recognition research, and hopefully it is of use to others as well. Earlier drafts of the work were sent to scholars from computer science, psychology, linguistics, neuroscience, and social sciences. Their comments and feedback have been invaluable and have helped shape this sheet for the better. (See detailed acknowledgments at the bottom.)

Note that I do not speak for the AER community. There is no “objective” or “correct” ethics sheet. This sheet should be taken as one perspective amongst many in the community.

Before we get into the ethics sheet in earnest, let us consider a few rapid-fire questions to set the context.

Q1. The idea of ethics sheets for AI tasks seems intriguing, but why did you decide to create an ethics sheet for emotion recognition yourself? Why not invite the community?

A. This has several answers:

The idea for ethics sheets for AI Tasks came later. I created this sheet to gather and organize my thoughts around ethical issues in emotion recognition research (with input from various other people). Eventually, I realized this sheet may be useful to others as well; and then that such sheets would be useful for other AI tasks as well.
Consider this AER sheet as an example proof-of-concept (“here is what it looks like when I tried” sheet) and I invite the community to create better AER ethics sheets by engaging the stakeholders more extensively (and building on this sheet if they so desire).
In some ways, ethics sheets are akin to survey papers. Their scope is not individual pieces of work, but a body of literature. One can argue that survey articles should be community efforts or that they be created by all stakeholders. However, we also value the expertise of individual or small groups of researchers to create survey articles. We agree that it is their perspective and does not speak for the whole community. A similar affordance could be given to creators of ethics sheets. It is also great to have multiple ethics sheets for the same task, just as it is useful to have multiple survey articles for the same area of research — they provide different perspectives.
Finally, community efforts have the tendency to only include agreed upon non-controversial ideas that do not threaten existing power structures.

So, IMHO, it is better to have multiple ethics sheets about the same tasks with differing ideas and viewpoints. We should be wary of the world where we have single authoritative ethics sheets per task.

Q2. A good ethics sheet makes us question our assumptions. So let’s start at the top: Should we be building AI systems for Automatic Emotion Recognition? Is it ethical to do so?

A. That is a good question. It is a big question. This sheet will not explicitly answer the question, but it will help in clarifying and thinking about it. This sheet will sometimes suggest that certain applications in certain contexts are good or bad ideas, but largely it will discuss what are the various considerations to be taken into account when deciding how to build or use a particular system, whether to build or use a particular system, what is more appropriate for a given context, how to assess success, etc.

The question is also somewhat under-specified. We first need to ask…

Q3. What does it mean to do automatic emotion recognition?

A. Emotion recognition can mean many things, and it has many forms. (This sheet will get into that.) It can be deployed in many contexts. For example, many will consider automated insurance premium decisions based on automatically inferred emotions to be inappropriate. However, studying how people use language to express gratitude, sadness, etc. is considered okay in many contexts. For a human-computer interaction system, it is useful to be able to identify which utterances can convey anger, joy, sadness, hate etc. (You do not want to create an offensive question-answering system, for example.) Many other contexts are described in the sheet.

Q4. Can machines infer one’s true emotional state ever?

A. No. (This sheet will get into that.)

Q5. Can machines infer some small aspect of someone’s emotional state, in some contexts, with some likelihood?

A. Yes, but that is such a weak claim that it is not useful.

Q6. Can machines infer some small aspect of people’s emotions (or emotions that they are trying to convey or perceived emotions) in some contexts, to the extent that it is *useful* to the people?

A. In my view, yes. In a very limited way, I see this as analogous to something like machine translation or web search. The machine does not understand language, nor does it understand what the user really wants, nor the social, cultural, or embodied context, but it is able to produce a somewhat useful translation or search result with some likelihood; and it produces some amount of inappropriate and harmful results with some likelihood. However, unlike machine translation or search, emotions are much more personal, private, and complex.

People cannot fully determine other people’s emotions. People cannot fully determine their own emotional state. But we make do with our limitations and infer emotions as best we can to function socially. Yes, we have a big brain and lots of world knowledge, but we have severe moral, ethical, and emotional limitations as well. We cause harm because of our limitations, and we harbor stereotypes and biases.

If machines are to be a part of this world and interact with people in any useful and respectful way, then they must have at least some limited emotion recognition capabilities; and they will always cause some amount of harm. Thus, if we use them, it is important that we are aware of the limitations; design systems that protect and empower those without power; deploy them in the contexts they are designed for; use them to assist human decision making; and work to mitigate the harms they will perpetrate.

We need to hold AER systems to high standards, not just because it is a nice aspirational goal, but because machines impact people at scale (in ways that individuals rarely can) and emotions define who we are (in ways that other attributes rarely do).

I hope this sheet is useful in that regard.

Primary motivation for creating and sharing this sheet for AER:

to create a go-to point for a carefully compiled substantive engagement with the ethical issues relevant to emotion recognition; going beyond individual systems and datasets and drawing on knowledge from a body of past work.

The hope is that this document will be useful to anyone who wants to build or use emotion recognition systems/algorithms for research or commercial purposes. (General benefits of ethics sheets are discussed in this post.)

Note that even though this sheet is focused on AER, many of the ethical considerations apply broadly to natural language tasks in general. Thus, it can serve as a useful template to build ethics sheets for other tasks.

Abbreviations: Automatic Emotion Recognition (AER), Artificial Intelligence (AI), Machine learning (ML), Natural Language Processing (NLP)

Target audience: The primary audience for this sheet are researchers, engineers, developers, and educators from NLP, ML, AI, data science, public health, psychology, digital humanities, and other fields that build, make use of, or teach about AER technologies and emotion resources; however, much of the discussion should be accessible to all stakeholders of AER, including educators, policy/decision makers, those who are impacted by AER.

After more community input, I hope we can also create a version of this sheet where non-technical stakeholders are the primary audience.

Feedback: As detailed as this sheet is, it is probably missing some important points. Some discussions can likely be framed better. Send me a note and I will be happy to incorporate feedback. Hopefully, this article will stimulate further discussion and better versions.

Contact: Dr. Saif M. Mohammad (Email: saif.mohammad@nrc-cnrc.gc.ca)

MAIN SHEET (version 1.0)

This ethics sheet for Automatic Emotion Recognition has four sections: Modalities & Scope, Task, Applications, and Ethical Considerations. The first three set the context. The fourth presents various ethical considerations of AER as a numbered list, organized in thematic groups.

MODALITIES AND SCOPE

MODALITIES: Work on AER has made use of a number of modalities (sources of input), including:

Facial expressions
Gait (how one is walking, body language), body velocity
Skin conductance, blood conductance, blood flow, respiration
Gestures
Force of touch, keyboards
Infrared emanations, haptic (sensors of force) and proprioceptive (position and movement of the body) data
Behavioral data collected over time
Speech
Language (esp. written text, emoticons, emojis)
[The focus of this sheet.]

All of these modalities come with benefits, potential harms, and ethical considerations.

SCOPE: This sheet will focus on AER from written text and AER in Natural Language Processing (NLP), but many of the considerations apply broadly to various modalities and AER in Computer Vision as well. Several considerations apply to AER (regardless of modality).

TASK

Automatic Emotion Recognition (AER) from one’s utterances (written or spoken) is a broad umbrella term used to refer to a number of related tasks such as those listed below: (Note that each of these framings has ethical considerations and may be more or less appropriate for a given context.)

Inferring emotions felt by the speaker (e.g., given Sara’s tweet, what is Sara feeling?);
Inferring emotions of the speaker as perceived by the reader/listener (e.g., what does Li think Sara is feeling?);
Inferring emotions that the speaker is attempting to convey (e.g., what emotions is Sara is trying to convey?)
These may be correlated, but they can be different depending on the particular instance. The first framing “inferring emotions felt by the speaker” is fairly common in scientific literature, but also perhaps most often misused/misinterpreted. More on this in the ethical considerations section.
Inferring the intensity of the emotions discussed above
Inferring patterns of speaker’s emotions over longer periods of time, across many utterances; including the inference of moods, emotion dynamics, and emotional arcs (e.g., tracking character emotion arcs in novels; tracking impact of health interventions on a patient’s well-being)
Inferring speaker’s emotions/attitudes/sentiment towards a target product, movie, person, idea, policy, entity, etc. (e.g., does Sara like the new phone?)
Inferring emotions evoked in the reader/listener (e.g., what feelings arise in Li on reading Sara’s tweet?)
This may be different among different readers because of their past experiences, personalities, and world-views: e.g., the same text may evoke different feelings among people with opposing views on an issue.
Inferring emotions of people mentioned in the text (e.g., given a tweet that mentions Moe, what emotional state of Moe is conveyed in the tweet?)
Inferring emotionality of language used in text (regardless of whose emotions) (e.g., is the tweet talking about happy things, angry feelings, etc.?)
Inferring how language is used to convey emotions such as joy, sadness, loneliness, hate, hostility, etc. (e.g., what are the ways in which we convey sadness through language?)
Inferring the emotional impact of sarcasm, metaphor, idiomatic expression, dehumanizing utterance, hate speech, etc.

Note: The term Sentiment Analysis is commonly used to refer to the task described in bullet 4, especially in the context of product reviews etc. (and the sentiment is commonly labeled as positive negative, or neutral). On the other hand, determining the predilection of a person towards a policy, party, issue, etc. is usually referred to as Stance Detection, and involves classes such as favour and against.

Note: Many AER systems focus only the emotionality of the language used (bullet 7), even though their stated goal might be one of the other bullets. This may be appropriate in restricted contexts such as customer reviews or personal diary blog posts, but not always. [More on this in TASK DESIGN.]

Note: There is also a growing list of tasks whose focus is not directly the emotions, but rather associated phenomena, such as: whose emotions are being referred to in the text, who/what evoked the emotion, what types of human need was met or not met resulting in the emotion, etc.

See these surveys for more details: Mohammad 2021 examines emotions, sentiment, stance, etc.; Liu, 2020 focuses on sentiment analysis tasks.

APPLICATIONS

Emotions are pervasive and play a role in all aspects of our lives. So the potential for benefits of automatic emotion recognition are substantial. Below is a sample of some existing applications: (Note that this is not an endorsement of these applications. All of these benefits come with potential harms and ethical considerations. Use of AER by the military, for intelligence, and for education are especially controversial.)

Public Health: Assist public health research projects, including those on loneliness (Guntuku et al., 2019, Kiritchenko et al., 2020), depression (De Choudhury et al. 2013; Resnik et al. 2015), suicidality prediction (MacAvaney et al., 2021), bipolar disorder (Karam et al., 2014), stress (Eichstaedt et al., 2015), and well-being (Schwartz et al., 2013). See CL Psych workshop proceedings. See Chancellor et al., 2019 for ethical considerations on inferring mental health states from social media.
Commerce/Business: Track sentiment and emotions towards one’s products, track reviews, blog posts, YouTube videos and comments; develop virtual assistants, writing assistants; help advertise products that one is more likely to be interested in.
Government Policy and Public Health Policy: Tracking and documenting views of the broader public on a range of issues that impact policy (tracking amount of support and opposition, identifying underlying issues and pain points, etc.). Governments and health organizations around the world are also interested in tracking how effective their public health messaging has been in response to crises such as pandemics and climate change.
Art and Literature: Improve our understanding of what makes a compelling story, how do different types of characters interact, what are the emotional arcs of stories, what is the emotional signature of different genres, what makes well-rounded characters, why does art evoke emotions, how do the lyrics and music impact us emotionally, etc. Can machines generate art (generate paintings, stories, music, etc.)?
Social Sciences, Neuroscience, Psychology: Help answer questions about people. What makes people thrive? What makes us happy? What can our language tell us about our well-being? What can language tell us about how we construct emotions in our minds? How do we express emotions? How different are people in terms of what different emotion words mean to them? How different are people in terms of the emotionality in their utterances and how is that emotionality impacted by outside events?
Military, Policing, and Intelligence: Tracking how sets of people or countries feel about a government or other entities (especially controversial); tracking misinformation on social media and societal susceptibility to misinformation.

ETHICAL CONSIDERATIONS

The usual approach to building an AER system is to design the task (identify the emotions to capture, the process to be automated, etc.), compile appropriate data (label some of the data for emotions — a process referred to as human annotation), train ML models that capture patterns of language/vision and emotional expression from the data—the method, and evaluate the models by examining their predictions on a held-out test set (the model has not seen the correct emotion labels for this set). There are ethical considerations associated with each step of this development process.

Below are 50 considerations grouped by the associated development stage: Task Design, Data, Method, Impact, Privacy & Social Groups. Click on any bullet or section to jump to it (each bullet is ~1 minute read, each section is about 7 to 10 minute read). A recap of the tips presented in the five sections is available on the Tips and Strategies page.

TASK DESIGN

Summary: This section discusses various ethical considerations associated with the choices involved in the framing of the emotion task and the implications of automating the chosen task. Some important considerations include: Whether it is even possible to determine one’s internal mental state? And, whether it is ethical to determine such a private state? Who is often left out in the design of existing AER systems? I discuss how it is important to consider which formulation of emotions is appropriate for a specific task/project; while avoiding careless endorsement of theories that suggest a mapping of external appearances to inner mental states.A. THEORETICAL FOUNDATIONS
 
 1. Emotion Task and Framing
 2. Emotion Model and Choice of Emotions
 3. Meaning and Extra-Linguistic Information
 4. Wellness and Emotion
 5. Aggregate Level vs. Individual LevelB. IMPLICATIONS OF AUTOMATION
 
 6. Why Automate (Who Benefits, Shifting Power) 
 7. Embracing Neurodiversity
 8. Participatory/Emancipatory Design
 9. Applications, Dual use, Misuse
 10. Disclosure of Automation

DATA

Summary: This section has three broad themes: implications of using datasets of different kinds, the tension between human variability and machine normativeness, and the ethical considerations regarding the people who have produced the data. Notably, I discuss how on the one hand there is tremendous variability in human mental representation and expression of emotions, and on the other hand, is the inherent bias of modern machine learning approaches to ignore variability. Thus, through their behaviour (e.g., by recognizing some forms of emotion expression and not recognizing others), AI systems convey to the user what is "normal"; implicitly invalidating other forms of emotion expression.C. WHY THIS DATA
 
 11. Types of data
 12. Dimensions of dataD. HUMAN VARIABILITY VS. MACHINE NORMATIVENESS
 
 13. Variability of Expression and Mental Representation
 14. Norms of Emotions Expression
 15. Norms of Attitudes
 16. One "Right" Label or Many Appropriate Labels
 17. Label Aggregation
 18. Historical Data (Who is Missing and What are the Biases)
 19. Training-Deployment DifferencesE. THE PEOPLE BEHIND THE DATA
 
 20. Platform Terms of Service
 21. Anonymization and Ability to Delete One's information
 22. Warnings and Recourse
 23. Crowdsourcing

METHOD

Summary: This section discusses the ethical implications of doing AER using a given method. It presents the types of methods and their tradeoffs, as well as, considerations of who is left out, spurious correlations, and the role of context. Special attention is paid to green AI and the fine line between emotion management and manipulation.F. WHY THIS METHOD 24. Types of Methods and their Tradeoffs
 25. Who is Left Out by this Method
 26. Spurious Correlations
 27. Context is Everything
 28. Individual Emotion Dynamics
 29. Historical Behavior is not always indicative of Future Behavior
 30. Emotion Management, Manipulation
 31. Green AI

IMPACT AND EVALUATION

Summary: This section discusses various ethical considerations associated with the evaluation of AER systems (The Metrics) as well as the importance of examining systems through a number of other criteria (Beyond Metrics). Notably, this latter subsection discusses interpretability, visualizations, building safeguards, and contestability, because even when systems work as designed, there will be some negative consequences. Recognizing and planning for such outcomes is part of responsible development.G. METRICS 32. Reliability/Accuracy
 33. Demographic Biases
 34. Sensitive Applications
 35. Testing (on Diverse Datasets, on Diverse Metrics)H. BEYOND METRICS 36. Interpretability, Explainability
 37. Visualization
 38. Safeguards and Guard Rails
 39. Harms even when the System Works as Designed
 40. Contestability and Recourse
 41. Be wary of Ethics Washing

IMPLICATIONS FOR PRIVACY, SOCIAL GROUPS

Summary: This section presents ethical implications of AER for privacy and for social groups. These issues cut across Task Design, Data, Method, and Impact. The privacy section discusses both individual and group privacy. The idea of group privacy becomes especially important in the context of soft-biometrics determined through AER that are not intended to be able to identify individuals, but rather identify groups of people with similar characteristics. The subsection on social groups discusses the need for work that does not treat people as a homogeneous group (ignoring group differences and implicitly favoring the majority group) but rather values disaggregation and explores intersectionality, while minimizing reification and essentialization of social constructs such as race and gender.I. IMPLICATIONS FOR PRIVACY 42. Privacy and Personal Control
 43. Group Privacy and Soft Biometrics
 44. Mass Surveillance vs. Right to Privacy, Expression, Protest
 45. Right Against Self-Incrimination
 46. Right to Non-DiscriminationJ. IMPLICATIONS FOR SOCIAL GROUPS 47. Disaggregation
 48. Intersectionality
 49. Reification and Essentialization
 50. Attributing People to Social Groups

One can read these various sections in one go, or simply use it as a reference when needed; jumping to sections of interest.

**End of Overview. Individual Sections Begin**

Task Design

(Navigation links: Data, Method, Impact, Privacy & Social Groups)

A sketch of a truss bridge across a body of water. — Illustration from 1911 Encyclopaedia Britannica. Original caption: Quebec Bridge (original design). Source: Wikimedia.

A. THEORETICAL FOUNDATIONS

Domain naivete is not a virtue.

Lay out the theoretical foundations for the task from relevant research fields such as psychology, linguistics, and sociology, and relate the opinions of relevant domain experts to the task formulation.

#1. Emotion Task and Framing: Carefully consider what emotion task should be the focus of the work (whether conducting a human-annotation study or building an automatic prediction model). (See the TASK section above for a sample of common emotion tasks.) When building an AER system, a clear grasp of the task will help in making appropriate design choices. When choosing which AER system to use, a clear grasp of the emotion task most appropriate for the deployment context will help in choosing the right AER system. It is not uncommon for users of AER to have a particular emotion task in mind and mistakenly assume that an off-the-shelf AER system is designed for that task.

Each of of the emotion tasks has associated ethical considerations. For example,

Is the goal to infer one’s true emotions? Is it even possible to comprehensively determine one’s internal mental state by any AI or human? (Hint: No.)
Is it ethical to determine such a private? (Hint: Rarely, if ever.)

Realize that it is impossible to capture the full emotional experience of a person (even if one had access to all the electrical signals in the brain). A less ambitious goal is to infer some aspects of one’s emotional state.

Here, we see a distinct difference between AER that uses vision and AER that uses language. While there is little credible evidence of the connection between one’s facial expressions and one’s internal emotional state, there is a substantial amount of work on the idea that language is a window into one’s mind (Chomsky, 1975; Lakoff, 1877; Pinker, 2007) — which of course also includes emotions (Bamberg, 1997; Wiebe at al, 2006; Pennebaker, 2010).

That said, I doubt anyone believes one can determine the full (or even substantial portions) of one’s emotional state through their language. (See also discussion in consideration #2 Emotion Model and #13 Variability of Expression ahead on complexity of the emotional experience and variability of expression..)

Thus, often it may be more appropriate to frame the AER task differently, for example:

The goal is to study how people express emotions: Work that uses speaker-annotated labeled data such as emotion-word hashtags in tweets usually captures how people convey emotions. What people convey may not necessarily indicate what they feel.
The goal is to determine perceived emotion (how others may think one is feeling): Perceived by everyone or some subset? Emotion annotations by people who have not written the source text usually reveal perceived emotions. (This is most common in NLP data-annotation projects.) Annotation aggregation strategies, such as majority voting usually only convey emotions perceived by a majority group. Are we missing out on the perceptions of some groups? Perceived emotions are not necessarily the emotions of the speaker. (More on majority voting in DATA #7.)
The goal to determine emotionality of language used in text (regardless of whose emotions, target/stimulus, etc.). This may be appropriate in some restricted-domain circumstances for example, when one is looking at customer reviews. Here, the context is indicative that the emotionality in the language likely indicates attitude towards the product being reviewed. However, such systems have difficulty when dealing with movie and book reviews because then it has to distinguish between text expressing attitudes towards the book/movie from text describing what happened in the plot (which is likely emotional too).
The goal to determine trends at aggregate level: Emotionality of language is also useful when tracking broad patterns at an aggregate level e.g., tracking trends of emotionality in tens of thousands of tweets or text in novels over time. The idea is that aggregating information from a large number of instances leads to the determination of meaningful trends in emotionality. (See these examples: Paul and Drezde (2011), Mohammad (2011), Quercia et al. 2012. See also discussion in #5 Aggregate Level vs. Individual Level)

In summary, it is important to identify what emotions are the focus of one’s work, use appropriate data, and communicate the nuance of what is being captured to the stakeholders. Not doing so will lead to the misuse and misinterpretation of one’s work. Specifically, AER systems should not claim to determine one’s emotional state from their utterance, facial expression, gait, etc. At best, AER systems capture what one is trying to convey or what is perceived by the listener/viewer, and even there, given the complexity of human expression, they are often inaccurate.

A separate question is whether AER systems can determine trends in the emotional state of a person (or a group) over time? Here, inferences are drawn at aggregate level from much larger amounts of data. Studies on public health fall in this category. Here too, it is best to be cautious in making claims about mental state, and use AER as one source of evidence amongst many (and involve expertise from public health and psychology).

The image shows three figures: a. A word cloud of emotions with the caption “categorical/discrete emotions”, b. Plutchik’s wheel of basic emotions, c. three orthogonal axes with the labels valence, arousal, and dominance, with the caption “Russel et al.’s core dimensions.

#2. Emotion Model and Choice of Emotions: Work on automatic emotion recognition needs to opertationalize the aspect of emotion it intends to capture: decide on emotion-related categories or dimensions of interest.
Psychologists and neuro-scientists have identified several theories of emotion to inform these decisions:

The Basic Emotions Theory (BET): Work by Dr. Paul Ekman in 1960s galvanized the idea that some emotions (such as joy, sadness, fear, etc.) are universally expressed through similar facial expressions, and these emotions are more basic than others (Ekman 1992; Ekman and Davidson, 1994). This was followed by other proposals of basic emotions by Robert Plutchik, Izard and others. However, many of the tenets of BET, such as the universality of some emotions and their fixed mapping to facial expressions, stand discredited or are in question (Barrett, 2018).
The Dimensional Theory: Several influential studies have shown that the three most fundamental, largely independent, dimensions of affect and connotative meaning are valence (positiveness–negativeness/pleasure–displeasure), arousal (active–sluggish), and dominance (dominant–submissive / in control–out of control) (Osgood et al.,1957; Russell, 1980). Valence and arousal specifically are commonly studied in a number of psychological and neuro-cognitive explorations of emotion.
Cognitive Appraisal Theory: The core idea behind appraisal theory (Scherer, 1999; Lazarus, 1991) is that emotions arise from a person’s evaluation of a situation or event. (Some varieties of the theory point to a parallel process of reacting to perceptual stimuli as well.) Thus it naturally accounts for variability in emotional reaction to the same event since different people may appraise the situation differently. Criticisms of appraisal theory centre around questions such as: whether emotions can arise without appraisal; whether emotions can arise without physiological arousal; and whether our emotions inform our evaluations.
The Theory of Constructed Emotions: Dr. Lisa Barrett proposed a new theory, consistent with these experiments, on how the human brain constructs emotions from our experiences of the world around us and the signals from our body (Barrett, 2017).

Since ML approaches rely on human-annotated data (which can be hard to obtain in large quantities), AER research has often gravitated to the Basic Emotions Theory, as that work allows one to focus on a small number of emotions. This attraction has been even stronger in the vision AER research because of BET’s suggested mapping between facial expressions and emotions. However, as noted above, many of the tenets of BET stand debunked. (NLP work on AER largely does not use facial expression information.)

Consider which formulation of emotions is appropriate for your task/project. For example, one may choose to work with the dimensional model or the model of constructed emotions if the goal is to infer behavioural or health outcome predictions. Despite criticisms of BET, it makes sense for some NLP work to focus on categorical emotions such as joy, sadness, guilt, pride, fear, etc. (including what some refer to as basic emotions) because people often talk about their emotions in terms of these concepts. Most human languages have words for these concepts (even if our individual mental representations for these concepts vary to some extent). However, note that work on categorical emotions by itself is not an endorsement of the BET.

Do not refer to some emotions as basic emotions, unless you mean to convey your belief in the BET. Careless endorsement of theories can lead to the perpetuation of ideas that are actively harmful (such as suggesting we can determine internal state from outward appearance — physiognomy).

#3. Meaning and Extra-Linguistic Information: The meaning of an utterance is not only a property of language, but it is grounded in human activity,
social interactions, beliefs, culture, and other extra-linguistic events, perceptions, and knowledge (Harris, 1954; Chomsky, 1965; Ervin-Tripp, 1973; Bisk et al., 2020; Bender and Koller, 2020; Hovy and Yang, 2021). Thus one can express the same emotion in different ways in different contexts, different people express the same emotions in different ways, and the same utterances can evoke different emotions in different people. AER systems that do not take extra-linguistic information into consideration will always be limited in their capabilities, and risk being systematically biased, insensitive, and discriminatory. More on this in DATA #3, #4.

#4. Wellness and Emotion: The prominent role of one’s body in the theory of constructed emotion (Barrett, 2017), nicely accounts for the fact that various physical and mental illnesses (e.g., Parkinsons, Alzheimers, Cardiovascular Disease, Depression, Anxiety) impact our emotional lives. Existing AER systems are not capable of handling these inter-subject and within-subject variability and thus should not be deployed in scenarios where their decisions could negatively impact the lives of people; and where deployed, their limitations should be clearly communicated.

Emotion recognition is playing a greater role than ever before in understanding how our language reflects our wellness, understanding how certain physical and mental illnesses impact our emotional expression, and
understanding how emotional expression can help improve our well-being.
For some medical conditions, clinicians can benefit from a detailed history of one’s emotional state. However, people are generally not very good at remembering how they had been feeling over the past week, month, etc.
Thus an area of interest is to use AER to help patients track their emotional state.

See applications of AER in Public Health above. See also CL Psych workshop proceedings. Note, however, that these are cases where the technology is working firmly in an assistive role to clinicians and psychologists — providing additional information in situations where human experts make decisions based on a number of other sources of information as well. See Chancellor et al., 2019 for ethical considerations on inferring mental health states from one’s utterances.

#5. Aggregate Level vs. Individual Level: Emotion detection can be be used to make inferences about individuals or groups of people; for example, to assist one in writing, to recommend products or services, etc. or to determine broad trends in attitudes towards a product, issue, or some other entity. Statistical inferences tend to be more reliable when using large amounts of data and when using more relevant data. Systems that make predictions about individuals often have very little pertinent information about the individual and thus often fall back on data from groups of people. Thus, given the person-to-person variability and within-person variability discussed in the earlier bullets, they are imbued with errors and biases. Further, errors and biases that directly impact individuals can be especially detrimental because of the direct and personal nature of such interactions. They may, for example, attribute majority group behavior/preferences to the individual, further marginalizing those that are not in the majority.

Various ethical concerns, including privacy, manipulation, bias, and free speech, are further exacerbated when systems act on individuals.

Work on finding trends in large groups of people on the other hand benefits from having a large amount of relevant information to draw on. However, see #43 Group Privacy and #47 to #50 Implications for Social Groups for relevant concerns.

B. IMPLICATIONS OF AUTOMATION

What are the ethical implications of automating the chosen task?

#6. Why Automate (Who Benefits and How is this Shifting Power): When we choose to work on a particular AER task, or any AI task for that matter, it is important to ask ourselves why? Often the first set of responses may be straightforward: e.g., to automate some process to make people’s lives easier, or to provide access to some information that is otherwise hard to obtain, or to answer research questions about how emotions work. However, lately there has been a call to go beyond this initial set of responses and ask more nuanced, difficult, and uncomfortable questions such as:

Who will benefit from this work and who will not (Trewin et al., 2019)?
Will this work shift power from those who already have a lot of power to those that have less power (Kalluri, 2020)?
How can we reframe or redesign the task so that it helps those that are most in need (Monteiro, 2019)?

Left: Treviranus’ needs-of-everyone plot showing lots of dots in the center but many al spread out on the periphery. Centre: Kalluri’s call to shift power. A screenshot of her Nature publication. Right: Book cover of Montero’s Ruined by Design.

a. Paraphrasing Jutta Treviranus: If the needs of everyone are captured as dots, there would be a dense centre (20% space taken up by 80% of the people). Most task designs only work for this dense middle. Pareto principle, 80/20 rule.b. Pratyusha Kalluri asks AI practitioners to go beyond "Is AI good?" and ask whether "AI shifts power?".c. "Yes, design is political. Because design is labor, and your labor is political. Where you choose to expend your labor is a political act. Who we omit from those solutions is a political act." - Mike Monteio

Specifically for AER, this will involve considerations such as:

Are there particular groups of people who will not benefit from this task: e.g., people who convey and detect emotions differently than what is common (e.g., people on the autism spectrum), people who use language differently than the people whose data is being used to build the system (e.g., older people or people from a different region)?
If AER is used in some application, say to determine insurance premiums, then is this further marginalizing those that are already marginalized?
How can we prevent the use of emotion and stance detection systems for detecting and suppressing dissidents?
How can AER help those that need the most help?

Various other considerations such as those listed in this sheet can be used to further evaluate the wisdom in investing our labor in a particular task.

#7. Embracing Neurodiversity: Much of the ML/NLP emotion work has assumed homogeneity of users and ignored neurodiversity, alexithymia, and autism spectrum. These groups have significant overlap, but are not identical. They are also often characterized as having difficulty in sensing and expressing emotions. Therefore these groups hold particular significance in the development of an inclusive AER system. Existing AER systems implicitly cater to the more populous neurotypical group. At minimum, such AER systems should explicitly acknowledge this limitation. AER systems should report disaggregated performance metrics for various groups if possible. (More on disaggregation in IMPLICATIONS FOR SOCIAL GROUPS.)

Greater research attention needs to be paid to the neurodiverse group. When doing data annotations, we should try to obtain information on whether participants are neurodiverse or neurotypical (when participants are comfortable sharing that information), and include that information at an aggregate level when we report participant demographics. Work in Psychology has used scales such as the Toronto Alexithymia Scale (TAS-20) to determine the difficulty that people might have in identifying and describing emotions (Bagby et al., 1994). Autism symptom severity is often measured using the Autism‐Spectrum Quotient (Baron‐Cohen et al., 2001).

Greater efforts need to be made to include neuro-diverse groups in the design of systems for their benefit (see next point).

#8. Participatory/Emancipatory Design: Participatory design in research and systems development centers the people, especially marginalized and disadvantaged communities, such that they are not mere passive subjects but rather have the agency to shape the design process (Spinuzzi, 2005). This has also been referred to as emancipatory research (Humphries et al., 2000; Noel 2016; Barton and Hayhoe, 2020) and is pithily captured by the rallying cry “nothing about us without us”. These calls have developed across many different domains, including research pertaining to disability (Stone and Priestly, 1996; Seale et al., 2015), indigenous communities (Hall, 2014), autism spectrum (Fletcher-Watson et al. 2018; Bertilsdotter et al, 2019), and neurodiversity (Brosnan et al., 2017; Motti and Evmenova, 2019).

However, emotion recognition research and AI research in general has been slow to respond to these calls. See Motti and Evmenova, 2019 for specific recommendations for conducting studies with neuro-diverse participants.

#9. Applications, Dual Use, Misuse: AER is a powerful enabling technology that has a number of applications. Thus, like all enabling technologies it can be misused and abused.

Commercial Applications: Examples of troublesome commercial AER application include:

Using AER at airports and misclassifying an individual as dangerous simply because their facial expressions do not conform to a certain “norm”.
Detecting stance towards governing authorities to identify and persecute dissidents.
Using deception detection or lie detection en masse without proper warrants or judicial approval. (Using such technologies even in carefully restricted individual cases is controversial.)
Increasing someone’s insurance premium because the system has analyzed one’s social media posts to determine (accurately or inaccurately) that they are likely to have a certain mental health condition.
Ads that prey on the emotional state of people, e.g., user-specific advertising to people when they are emotionally vulnerable.
Fake news that preys on the emotional state of the people, e.g, micro-targeting fake news to people that the system has identified as being pre-disposed to believing.

Socio-Psychological Applications: Applications such as inferring patterns in emotions of a speaker to in turn infer other characteristics such as suitability for a job, personality traits, or health conditions are especially fraught with ethical concerns. For example, consider the use of the Myers–Briggs Type Indicator (MBTI) for hiring decisions or research on detecting personality traits automatically. Notable ethical concerns, include:

MBTI is severely criticized by psychologists, especially for its lack of test-retest reliability (Boyle, 1995; Gerras and Wong, 2016). The Big 5 personality traits formalism (Cobb-Clark and Shurer, 2012) has greater validity, but even when using Big 5, it is easy to overstate the conclusions.
Even with accurate personality trait identification, there is little to no evidence that using personality traits for hiring and team-composition decisions is beneficial. The use of such tests have also been criticized on the grounds of discrimination.

Left: Screenshot of article by Shane Snow “That article may be discriminating…” Right: Screenshot of article by Adam Grant “Say goodbye to MBTI…” — Articles talking about how personality tests are misused in the work place and how MBTI remains popular despite a much better alternative in the Big Five.

Health and Well-Being Applications: AER has considerable potential for benefit in improving our health and well-being outcomes. However, the sensitive nature of such applications require substantial efforts to adhere to the best ethical principles. For example, how can harm be mitigated when systems make errors? Should automatic systems be used at all given that sometimes we cannot put a value to the cost of errors? What should be done when the system detects that one is at a high risk of suicide, depression, or some other severe mental health condition? How to safeguard patient privacy? See the shared task at the 2021 CL Psych workshop where a secure enclave was used to store the training and test data. See these papers for ethical considerations of AI systems in health care (Yu et al. 2018; Lysaght et al. 2019; Panesar 2019).

Applications in Art and Culture: Lately there has been increasing use of AI in art and culture, especially through curation and recommendation systems. See Born et al (2021) for a discussion of ethical implications, including: are we really able to determine what art one would like, long-term impacts of automated curation (on users and artists), and diversity of sources and content.

AI is also used in the analysis and generation of art: e.g, for literary analysis and generating poems/paintings/songs/paintings. Since emotions are a central component of art, much of this work also includes automatic emotion recognition: e.g. tracking the emotions of characters in novels, recommending songs for people based on their mood, and generating emotional music. This raises several questions including:

Is it art if the creation did not involve human input? Can the product of AI be called creative?
Should AI play a collaborative role with other artists (enhancing their creativity) as opposed to generate pieces on its own?
How will artists be impacted by AI’s role in art?
Who should get credit for AI art?
How should we critique AI art?
Do we really want our art and music apps to be tracking our emotions?

See further discussion of these questions by Hertzmann (2020).

Interestingly, *human* art can play a crucial role in responsible AI research through: comics, poems, infographics, posters, murals, and parody (Srinivasan and Uchino, 2021).

#10. Disclosure of Automation: Disclose to all stakeholders the decisions that are being made (in part or wholly) by automation. Provide mechanisms for the user to understand why relevant predictions were made, and also to contest the decisions. (More on this in IMPACT — Beyond Metrics.)

Artificial agents that perceive and convey emotions in a human-like manner can deceive people into thinking that they are interacting with a human. Artificial agents should begin their interactions with humans by first disclosing that they are artificial agents (Dickson, 2018), even though some studies show certain negative outcomes of such a disclosure (Mozafari et al., 2018; Cicco et al., 2021).

Why AI Must Disclose That It’s AI

Google recently repitched Duplex to explicitly reveal to restaurant hosts and salon staff that they are speaking with…

www.pcmag.com

Data

(Navigation links: Task Design, Method, Impact, Privacy & Social Groups)

Beaks of different birds. — Variations in bird beaks. Source: Wikimedia.

C. WHY THIS DATA

What are the ethical implications of using the chosen data?

#11. Types of Data: Emotion and sentiment researchers have used text data, speech data, data from mobile devices, data from social media, product reviews, suicide notes, essays, novels, movie screenplays, financial documents, etc. All of these entail their own ethical considerations in terms of the various points discussed in this article.

AER systems use data in a number of ways. For example, a text processing AER system may use:

A Large Language Model: Language models such as BERT (that capture common patterns in language use) are obtained by training ML models on massive amounts of text found on the internet. See Bender et al. (2021) for ethical considerations in the use of large language models, including: documentation debt, difficult to curate, incorporation of inappropriate biases, and perpetuation of stereotypes. Note also that using smaller amounts of data raise concerns as well: they may not have enough generalizable information; they may be easier to overfit on; and they may not include diverse perspectives. An important aspect of preparing data (big or small) is deciding how to curate it (what parts to keep and what to discard).
Emotion Lexicons: Emotion Lexicons are lists of words and their associated emotions (determined manually by annotation or automatically from large corpora). Crowdsourced word–emotion association lexicons (such as the NRC Emotion Lexicon and the Valence, Arousal, Dominance Lexicon) are a popular type of resource used in emotion research, emotion-related data science, and machine learning models for AER. See Mohammad (2020) for biases and ethical considerations in the use of such emotion lexicons. Notable among these considerations is how words in different domains often convey different senses and thus have different emotion associations. Also, word associations capture historic perceptions that change with time and may differ across different groups of people. They are not indicative of inherent immutable emotion labels.
Labeled Training & Testing Data: AER systems often make use of a relatively small number of example instances that are manually labeled (annotated) for emotions. A portion of these is used to train/fine-tune the large language model (training set). The rest is further split for development and testing. I discuss various ethical considerations associated with using emotion-labeled instances below.

#12. Dimensions of Data: The data used by AER systems can be examined across various dimension: size of data; whether it is custom data (carefully produced for the research) or data obtained from an online platform / website (naturally occurring data); less private/sensitive data or more private/sensitive data; what languages are represented in the data; degree of documentation provided with the data; and so on. All of these have societal implications and the choice of datasets should be appropriate for the context of deployment.

D. HUMAN VARIABILITY VS. MACHINE NORMATIVENESS

What should we know about emotion data so that we use it appropriately?

#13. Variability of Expression and Mental Representation: Language is highly variable — we can express roughly the same meaning in many different ways.

Expressions of emotions through language are highly variable: Different people express the same emotion differently; the same text may convey different emotions to different people.

This is true even for people living in the same area and especially true for people living in different regions, and people with different lived experiences.

Some cues of emotion are somewhat more common and somewhat more reliable than others. This is usually the signal that automatic systems attempt to capture.
We construct emotions in our brains from the signals we get from the world and the signal we get from our bodies. This mapping of signals to emotions is highly variable, and different people can have different signals associated with different emotions (Barrett, 2018); therefore, different people have different concept–emotion associations. For example, high school, public speaking, and selfies may evoke different emotions in different people.

This variability is not to say that there are no commonalities. In fact, speakers of a language share substantial commonalities in their mental representation of concepts (including emotions), which enables them to communicate with each other. However,

the variability should also be taken into consideration when building datasets, systems, and choosing where to deploy the systems, otherwise the systems do not work for various groups of people, or they do not work well in general.

#14. Norms of Emotion Expression:

We shape our tools and thereafter they shape us.
— John M. Culkin

Whether text, speech, vision, or any other modality, AI systems are often trained on a limited set of emotion expressions and their emotion annotations (emotion labels for the expressions).

Thus, through their behaviour (e.g., by recognizing some forms of emotion expression and not recognizing others), AI systems convey to the user that it is “normal” or appropriate to convey emotions in certain ways; implicitly invalidating other forms of emotion expression.

Therefore it is important for emotion recognition systems to accurately map a diverse set of emotion instantiations to emotion categories/dimensions. That said, it is also worth noting that the variations in emotion and language expression are so large that systems can likely never attain perfection.

The goal is to obtain useful levels of emotion recognition capabilities without having systematic gaps that convey a strong sense of emotion-expression normativeness.

Normative implications of AER are analogous to normative implications of movies (especially animated ones):- Badly executed movies show characters expressing emotions in certain fixed stereo-typical ways.- Good movies explore the diversity, nuance, and subtlety of human emotion expression.- Influential movies (bad and good) convey to a wide audience around the world how emotions are expressed or what is “normal” in terms of emotion expression. Thus they can either colonize other groups, reducing emotion expression diversity, or they can validate one's individualism and independence of self-expression.

Several pictures of Mike from Monsters Inc with various facial expressions. — Different ways in which Mike (from Monsters Inc.) expresses himself.

Since AI systems are greatly influenced by the data they train on, dataset development should pay attention to whether a truly diverse set of emotion expressions are being captured. Therefore:

Obtain data from a diverse set of sources. Report details of the sources.
Studies have shown that a small percentage of speakers often produce a large percentage of utterances (see study on tweets). Thus, when creating emotion datasets, limit the number of instances included per person. Mohammad and Kiritchenko (2018) kept one tweet for every query term and tweeter combination when studying relationships between affect categories (data also used in a shared task on emotions). Kiritchenko et al., (2020) kept at most three tweets per tweeter when studying expressions of loneliness.
Obtain annotations from a diverse set of people. Report aggregate-level demographic information of the annotators.

Variability is common not just for emotions but also for natural language. People convey meaning in many different ways. There is no one “correct” way of articulating our thoughts. Thus, these considerations apply to NLP in general.

#15. Norms of Attitudes: Different people and different groups of people might have different attitudes, perceptions, and associations with the same product, issue, person, social groups, etc. Annotation aggregation, by say majority vote, may convey a more homogeneous picture to the ML system. Annotation aggregation may also capture stereotypes and inappropriate associations for already marginalized groups. (For example, majority group A may perceive a minority group B as less competent, or less generous.) Such inappropriate biases are also encoded in large language models. When using language models or emotion datasets, assess the risk of such biases for the particular context and take correcting action as appropriate.

#16. One “Right” Label or Many Appropriate Labels: When designing data annotation efforts, consider whether there is a “right” answer and a “wrong”? Who decides what is correct/appropriate? Are we including the voices of those that are marginalized and already under-represented in the data?

When working with emotion and language data, there are usually no “correct” answers, but rather, some answers are more appropriate than others. And there can be multiple appropriate answers.

If a task has clear correct and wrong answers and knowing the answers requires some training/qualifications, then one can employ domain experts to annotate the data. However, emotion annotations largely do not fall in this category.
If the goal is to determine how people use language, and there can be many appropriate answers, or we want to know how people perceive words, phrases, and sentences then we might want to employ a large number of annotators. This is much more in line with what is appropriate for emotion annotations — people are the best judges of their emotions and of the emotions they perceive from utterances.

Seek appropriate demographic information (respectfully and ethically). Conveying annotator demographics is important because:

Part of conveying that there is no one “correct” answer is to convey how the dataset is situated in: who annotated the data, the precise annotation instructions, what data was presented to the annotators (and in what form), when the data was annotated, etc.

#17. Label Aggregation: How should we aggregate information about an instance from multiple annotators? Majority voting has limitations: e.g., it tends to capture majority group attitudes (at the expense of other groups). (See also Aroyo and Welty (2015), Checco et al. (2017), and Klenner et al. (2020).) As a result, sometimes researchers have released not just the aggregated results but also the raw (pre-aggregated data), as well as various versions of aggregated results. Others have argued in favor of not doing majority voting at all and including all annotations as input to ML systems (Basile, 2020). However, saying all voices should be included has its own problems: e.g., how to address and manage inappropriate/racist/sexist opinions; how to disentangle low-frequency valid opinions from genuine annotation errors and malicious annotations?
(See also #15 Norms of Attitudes and #47 Disaggregation.)

If using majority voting, acknowledge its limitations. Acknowledge that it may be missing some/many voices.

Explore statistical approaches to finding multiple appropriate labels, while still discarding noise.

Employ separate manual checks to determine whether the human annotations also capture inappropriate human biases. Such biases may be useful for some projects (e.g., work studying such biases), but not for others. Warn users of which inappropriate biases may exist in the data; and any strategies to deal with them when using the dataset.

#18. Historical Data (Who is Missing and What are the Biases): Machine learning methods feed voraciously on data (historical data). Natural language processing systems often feed on huge amounts of data collected from the internet. However, the data is not representative of everyone and seeped into this data are our biases.

Historical data over-represents people who have had power, who are more well to do, mostly from the west, mostly English-speaking, mostly white, mostly able-bodied, and so on and so forth. So the machines that feed on such data often learn their perspectives at the expense of the views of those already marginalized.

When using any dataset, devote resources to study who is included in the dataset and whose voices are missing. Take corrective action as appropriate.

Keep a portion of your funding for work on marginalized communities. Keep a portion of your funding for work on less-researched languages (Ruder, 2020).

Why You Should Do NLP Beyond English

Natural language processing (NLP) research predominantly focuses on developing methods that work well for English…

ruder.io

#19. Training–Deployment Data Differences: The accuracy of supervised systems is contingent on the assumption that the data the system is applied to is similar to the data the system was trained on. Deploying an off-the-shelf sentiment analysis system on data in a different domain, from a different time, or a different class distribution than the training data will likely result in poor predictions. Systems that are to be deployed to handle open-domain data should be trained on many diverse datasets and tested on many datasets that are quite different from the training datasets.

E. THE PEOPLE BEHIND THE DATA

What are the ethical implications on the people who have produced the data?

When building systems, we make extensive use of (raw and emotion-labeled) data. It can sometimes be easy to forget that behind the data are the people that produced it, and imprinted in it are a plethora of personal information.

#20. Platform Terms of Service: Data for ML systems is often scraped from websites or extracted from large online platforms (e.g., Twitter, Reddit) using APIs. The terms of service for these platforms often include protections for the users and their data. Ensure that the terms of service of the source platforms are not violated: e.g., data scraping is allowed and data redistribution is allowed (in raw form or through ids). Ensure compliance with the robot exclusion protocol.

#21. Anonymization and Ability to Delete One’s information: Take actions to anonymize data when dealing with sensitive or private data; e.g., scrub identifying information. Some techniques are much better at anonymization than others. (See for example, privacy-preserving work on word embeddings and sentiment data by Thaine and Penn (2021).) Provide mechanisms so that people can remove their data from the dataset.

Choose to not work with a dataset if adequate safeguards cannot be placed.

#22. Warnings and Recourse: Annotating highly emotional, offensive, or suicidal utterances can adversely impact the well-being of the annotators. Provide appropriate warnings. Minimize amount of data exposure per annotator. Provide options for psychological help as needed.

#23. Crowdsourcing: Crowdsourcing (splitting a task into multiple independent units and uploading them on the internet so that people can solve them online) has grown to be a major source of labeled data in NLP, Computer Vision, and a number of other academic disciplines. Compensation often gets most of the attention when talking about crowdsourcing ethics, but there are several ethical considerations involved with such work such as: worker invisibility, lack of learning trajectory, humans-as-a-service paradigm, worker well-being, and worker rights. See Fort et al. (2011), Irani and Silberman (2013), Standing and Standing (2017), Shmueli (2021). See also these (public) guidelines by AI2 for its researchers:

Crowdsourcing: Pricing Ethics and Best Practices

This post describes the guidelines we follow for ethical pricing and managing of crowdsourced work at the Allen…

medium.com

Method

(Navigation links: Task Design, Data, Impact, Privacy & Social Groups)

Illustration of various ways bricks are used. Source: Wikimedia.

F. WHY THIS METHOD

What are the ethical implications of using a given method?

#24. Types of Methods and their Tradeoffs: Different types of methods have different trade-offs:

less accurate — more accurate: this usually gets all the attention; value other dimensions listed below as well. (Further discussion in IMPACT & EVALUATION.)
white box (can understand why system makes a given prediction) — black box (do not know why it makes a given prediction): understanding the reasons behind a prediction help identify bugs and biases; helps contestability; arguably, better suited for answering research questions about language use and emotions.
less energy efficient — more energy efficient: See discussion further below on Green AI.
less data hungry — more data hungry: data may not always be abundant; needing too much data of a person leads to privacy concerns.
less privacy preserving — more privacy preserving: There is greater appreciation lately for the need for privacy-preserving NLP.
leads to fewer inappropriate biases — leads to more inappropriate biases: We want our algorithms to not perpetuate/amplify inappropriate human biases.

Consider various dimensions of a method and their importance for the particular system deployment context before deciding on the method. Focusing on fewer dimensions may be okay in a research system, but widely deployed systems often require a good balance across the many dimensions.

#25. Who is Left Out by this Method: The dominant paradigm in Machine Learning and NLP is to use use large pre-trained models pre-trained on massive amounts of raw data (unannotated text, pictures, videos, etc.) and then fine-tuned on small amounts of labeled data (e.g., sentences labeled with emotions) to learn how to perform a particular task. As such, these methods tend to work well for people that are well-represented in the data (raw and annotated), but not so well for others. (See also #18 Historical Data.)

Even just documenting who is left out is a valuable contribution. Explore alternative methods that are more inclusive, especially for those not usually included by other systems.

#26. Spurious Correlations: Machine learning methods have been shown to be susceptible to spurious correlations. For example, Agrawal et al. 2016 show that when asked what is the ground covered with, visual QA systems tend to always say snow, because in the training set, this question was only asked for when the ground was covered with snow. Winkler et al. 2019 and Bissoto et al. 2020 show spurious correlations in melanoma and skin lesion detection systems. Poliak et al. 2018 and Gururangan et al. 2018 show that natural language inference systems can sometimes decide on the prediction just from information in the premise, without regard for the hypothesis (for example, because a premise with negation is often a contradiction in the training set).

Similarly, machine learning systems capture spurious correlations when doing AER. For example, marking some countries and people of some demographics with less charitable and stereotypical sentiments and emotions. This phenomenon is especially marked in abusive language detection work where it was shown that data collection methods in combination with the ML algorithm result in the system marking any comment with identity terms such as gay, muslim, and jew as offensive.

Consider how the data collection and machine learning set ups can be addressed to avoid such spurious correlations, especially correlations that perpetuate racism, sexism, and stereotypes.

In extreme cases, spurious correlations can form the basis of pseudoscience and physiognomy. For example, there have been a spate of papers attempting to determine criminality, personality, trustworthiness, and emotions just from one’s face or outer appearance. It should be noted that sometimes, systematic idiosyncrasies of the data can lead to apparent good results on a held out test set even on such tasks. Thus it is important to consider whether the method and sources of information used are expected to capture the phenomenon of interest? Is there a risk that the use of this method may perpetuate false beliefs and stereotypes? If yes, take appropriate corrective action.

#27. Context is Everything: Considering a greater amount of context is often crucial in correctly determining emotions/sentiment. What was said/written before and after the target utterance? Where was this said? What was the intonation and what was emphasized? Who said this? And so on. More context can be a double-edged sword though. The more the system wants to know about a person to make better predictions, the more we worry about privacy.

Work on determining the right balance between collecting more user information and privacy considerations, as appropriate for the context in which the system is deployed.

#28. Individual Emotion Dynamics: A form of contextual information is one’s utterance emotion dynamics (Hoolenstein, 2015). The idea is that different people might have different steady states in terms of where they tend to most commonly be (considering any affect dimension of choice). Some may move out of this steady state often, but some may venture out less often. Some might go very far from the steady state and some might not go that far. Some recover quickly from the deviations, and for some it may take a lot of time.

Similar emotion dynamics occur in the text that people write or the words they utter — Utterance Emotion Dynamics (Hipson and Mohammad, 2021). The degree of correlation between the utterance emotion dynamics and the true emotion dynamics may be correlated, but one can argue that examining utterance emotion dynamics is valuable on its own.

Looking at this history of utterance emotion dynamics provides greater context and helps judge the degree of emotionality of new utterances by the person. Systems that make use of such detailed contextual information are more likely to make appropriate predictions for diverse groups of people. However, the degree of personal information they require warrants care, concern, and meaningful consent from the users.

Paper introducing Utterance Emotion Dynamics.

#29. Historical behavior is not always indicative of future behavior (for groups and individuals): Systems are often trained on static data from the past. However, perceptions, emotions, and behavior change with time. Thus automatic systems may make inappropriate predictions on current data.

Black and white photo of a group of people posing for a photograph. No one is smiling. — People did not smile in early portraits and photographs. The wide availability of cameras brought on more spontaneous poses and more smiles.

#30. Emotion Management, Manipulation: Managing emotions is a central part of any human — computer interaction system (even if this is often not an explicitly stated goal). Just as in human — human interactions, we do not want the systems we build to cause undue stress, pain, or unpleasantness. For example, a chatbot has to be careful to not offend or hurt the feelings of the user with which it is interacting. For this, it needs to assess the emotions conveyed by the user, in order to then be able to articulate the appropriate information with appropriate affect.

However, this same technology can enable companies and governments to detect one’s emotions to manipulate their behavior. It is known that we buy more things when we are sad, for example. So sensing when you are most susceptible to suggestion to plant ideas of what to buy, who to vote for, or who to dislike, can have dangerous implications. On the other hand, identifying how to cater to individual needs to improve their compliance with public health measures in a world-wide pandemic, or to help people give up on smoking, may be seen in more positive light. As with many things discussed in this article, consider the context to determine what levels of emotional management and meaningful consent are appropriate.

Examples of inappropriate manipulation:

Ads that prey on the emotional state of people, e.g., user-specific advertising to people when they are emotionally vulnerable.
Fake news that preys on the emotional state of the people, e.g, micro-targeting fake news to people that the system has identified as being pre-disposed to believing.

#31. Green AI: A direct consequence of using ever-increasing pre-trained models (large number of training examples and hyperparameters) for AI tasks is that these systems are now drivers of substantial energy consumption. (See papers below.)

Recent papers showing the increasing carbon footprint of AI systems and approaches to address them.

Thus, there is a growing push to develop AI methods that are not singularly focused on accuracy numbers on test sets, but are also mindful of efficiency and energy consumption (Schwartz et al. 2019). The authors encourage reporting of cost per example, size of training set, number of hyperparameters, and budget-accuracy curves. They also argue for regarding efficiency as a valued scientific contribution.

Impact and Evaluation

(Navigation links: Task Design, Data, Method, Privacy & Social Groups)

Illustration of a person happily coming out of the mouth of large fish with stunned onlookers. — Gustave Doré’s Illustration of Baron von Münchhausen for his tale of being swallowed by a whale. (Source: Wikimedia.) **Some AI Systems tell tall tales too.**

G. METRICS

All evaluation metrics are misleading. Some metrics are more useful than others.

#32. Reliability/Accuracy: No automatic emotion recognition method is perfect. However, some approaches are much less accurate than others. Deployment of approaches that essentially produce close-to-random accuracy is unethical.

Some techniques are so unreliable that they are essentially pseudoscience. For example, trying to predict personality, mood, or emotions through physical appearances has long been criticized (Arcas et al., 2017). See also this seminal paper by Barrett et al. (2019) pointing out the low reliability of recognizing emotions from facial expressions.

Screenshot of landing page for the paper “Emotional Expressions Reconsidered”. — Seminal paper by Barrett et al. challenging the use of emotion recognition through facial expression.

#33. Demographic Biases: Some approaches can be unreliable or systematically inaccurate for certain groups of people, races, genders, people with health conditions, people that are on the autism spectrum, people from different countries, etc. Such systematic errors can occur when working on:

Text, faces, gaits of certain demographics. For example, low accuracy in recognizing faces and emotions in text produced by African Americans.
Text mentioning certain demographics. For example, systematically marking texts mentioning African Americans as more angry, texts mentioning women as more emotional.

Determine and present disaggregated accuracies. Take steps to address disparities in performance across groups. (See more discussion in #47 Disaggregation.)

#34. Sensitive Applications: Some applications are considerably more sensitive than others and thus necessitate the use of a much higher quality of emotion recognition systems (if used at all). Use of less accurate systems in such applications is unethical. Alternatively, automatic systems may sometimes be used in high-stakes applications if their role is to assist human experts. For example, assisting patients and health experts in tracking the patient’s emotional state; or identifying possible areas of concerns for people with mental health conditions. However, care must be taken to ensure that domain experts are leading such projects and have adequate support to understand the limitations of the system.

#35. Testing (on Diverse Datasets, on Diverse Metrics): Results on any test set are contingents on the attributes of that test set and may not be indicative of real-world performance, or implicit biases, or systematic errors of many kinds. Good practice is to test the system on many different datasets that explore various input characteristics. For example, see these evaluations that cater to a diverse set of emotion-related tasks, datasets, linguistic phenomena, and languages: SemEval 2014 Task 9, SemEval 2015 Task 10, and SemEval 2018 Task 1. (The last of which also includes and evaluation component for demographic bias in sentiment analysis systems.)

See Rottger at al (2021) for work on creating separate diagnostic datasets for various types of hate speech.

See Google’s recommendations on best practices on metrics and testing.

Responsible AI practices — Google AI

These questions are far from solved, and in fact are active areas of research and development. Google is committed to…

ai.google

H. BEYOND METRICS

Are we even measuring the right things?

#36. Interpretability, Explainability: As ML systems are deployed more widely and impact a greater sphere of our lives, there is a growing understanding that these systems can be flawed to varying degrees. One line of approach in understanding and addressing these flaws is to develop interpretable or explainable models. Interpretability and explainability each have been defined in a few different ways in the literature, but at the heart of the definitions is the idea that we should be able to understand why a system is making a certain prediction: what pieces of evidence are contributing to the decision and to what degree? That way, humans can better judge how valid a particular prediction is, better judge how accurate the model is for certain kinds of input, and even how accurate the system is in general and over time.

In line with this, AER systems should have components that depict why they are making certain predictions for various inputs. Such components can be viewed from several perspectives (as described in this Luo et al. 2021 Survey), including:

are the explanations meant for the scientist/engineer or to a lay person.
are the explanations faithful to what is really going on under the hood of the system.
are the explanations easily comprehensible.
to what extent do people trust the explanations.

Responsible research and product development entails actively considering various explainability strategies at the very outset of the project. This includes, where appropriate, specifically choosing an ML model that lends itself to better interpretability, running ablation and disaggregation experiments, running data perturbation and adversarial testing experiments, and so on.

Left: Word clouds of words associated with emotions. Right: Emotion trajectories in novels. — Visualizations from early work on tracking emotions in novels. See Paper.

#37. Visualization: Visualizations help convey trends in emotions and sentiments, and are common in the emotion analysis of streams of data such as tweet streams, novels, newspaper headlines, etc. There are several considerations when developing visualizations that impact the extent to which they are effective, convey key trends, and the extent to which they may be misleading.

It is almost always important to not only show the broad trends but also to allow the user to drill down to the source data that is driving the trend.
Summarize the data driving the trend, for example through treemaps of the most frequent emotion words and phrases in the data.
Interactive visualizations allow users to navigate different aspects of the data and even drill down to the source data that is driving some of the trends.

See work on visualizing emotions and sentiment (Mohammad, 2011; Dwibhasi et al., 2015; Kucher et al, 2018; Fraser et al., 2019; Gallagher et al., 2021).

#38. Safeguards and guard rails: Devote time and resources to identify how the system can be misused and how the system may cause harm because of it’s inherent biases and limitations. Identify steps that can be taken to mitigate these harms.

#39. Recognize that there will be harms even when the system works “correctly”: Provide a mechanism for users to report issues. Have resources and guidelines in place to deal with unanticipated harms. Document societal impacts, including both benefits and harms.

#40. Contestability and Recourse: Mulligan et al. (2019) argue that contestability — the mechanisms made available to challenge the predictions of an AI system — are more important and beneficial for responsible research than transparency/explainability. Not only do they allow people to challenge the decisions made by a system, they also invite participation in the understanding of how machine learning systems work and their limitations. See Google’s The What-If Tool for a great example of how people are invited to explore ML systems by changing inputs (without needing to do any coding).

AER systems are encouraged to produce similar tools, for example:

tools that allow one to see counterfactuals—given a data point, what is the closest other data point for which the system predicts a different label; tools that allow one to try out various input conditions/features to see what help obtain the desired classification label. For example, for a given input sentence, what are closest sentences in the training data and what are their emotion labels? Which closest sentence has an emotion label different from the predicted emotion label on the given sentence?
tools that allow one to see classification accuracies on different demographics and the impact of different classifier parameters and thresholds on these scores.
tools that allow one to see confidence of the classifier for a given prediction and the features that were primarily responsible for the decision.

See also Denton et al. (2020) for a discussion on greater participation by people in dataset creation and management.

#41. Be wary of Ethics Washing: As we push farther into incorporating ethical practices in our projects, we need to be wary of inauthentic and cursory attention to ethics for the sake of appearances. This VentureBeat article presents some nice tips to avoid ethics washing, including: “Welcome ‘constructive dissent’ and uncomfortable conversations”, “Don’t ask for permission to get started”, “Share your shortcomings”, “Be prepared for gray area decision-making”, and “Understand that ethics has few clear metrics”.

Text in large font saying “How AI companies can avoid ethics washing” — Article with tips on avoiding ethics washing.

A note on boiler plate ethics text: Sometimes boilerplate ethics statements can be seen as a mild form of ethics washing. It is better to customize the statements for the particular situation at hand. That said, what may seem boiler plate to some may still be useful information for someone new to the field, and it still conveys a sense of “these issues matter to the community, even if we do not have anything particularly novel to add”.

Implications for Privacy, Social Groups

(Navigation links: Task Design, Data, Method, Impact)

A group of shoaling surgeonfish around coral. — Shoaling surgeonfish — “swimming somewhat independently, but in such a way that they stay connected, forming a social group”. Shoaling is the phenomenon of fish staying together for social reasons. **Both individuals and social groups need protections from AI systems.**

I. IMPLICATIONS FOR PRIVACY

(Cuts across Task Design, Data, Method, Impact)

#42. Privacy and Personal Control: As noted privacy expert and former Information and Privacy Commissioner for the Canadian province of Ontario, Dr. Ann Cavoukian, puts it: privacy is not about hiding information or secrecy. It is about choice:

You have to be the one to make the decision. That’s why the issue of personal control is so important. — Dr. Ann Cavoukian

People might not want their emotions to be inferred. Applying emotion detection systems en masse — gathering emotion information continuously, without meaningful consent, is an invasion of privacy, harmful to the individual, and dangerous to society.

Screenshot of article titled “What if your emotions were tracked to Spy on You?” — Podcast on the privacy implications of tracking emotions. Also, a document to the members and staff of the European Parliament.

Follow Dr. Cavoukian’s seven principles of privacy by design:

Proactive not Reactive; Preventative not Remedial
Privacy as the Default
Privacy Embedded into Design
Full Functionality — Positive-Sum, not Zero-Sum
End-to-End Security — Full Lifecycle
Visibility and Transparency — Keep it Open
Respect for User Privacy — Keep it User-Centric

#43. Group Privacy and Soft Biometrics: Floridi (2014) argues that many of our conversations around privacy are far too focused on individual privacy and ignore group privacy — the rights and protections we need as a group.

There are very few Moby-Dicks. Most of us are sardines. The individual sardine may believe that the encircling net is trying to catch it. It is not. It is trying to catch the whole shoal. It is therefore the shoal that needs to be protected, if the sardine is to be saved. — Floridi (2014)

The idea of group privacy becomes especially important in the context of soft-biometrics such as traits and preferences determined through AER that are not intended to be able to identify individuals, but rather identify groups of people with similar characteristics. See McStay, 2020 for further discussions on the implications of AER on group privacy and how companies are using AER to determine group preferences, even though a large number of people disfavour such profiling.

#44. Mass Surveillance versus Right to Privacy, Right to Freedom of Expression, and Right to Protest: Emotion recognition, sentiment analysis, and stance detection can be used for mass surveillance by companies and governments (often without meaningful consent). There is low awareness in people that their information (e.g., what they say or click on an online platform) can be used against their best interest. Often people do not have meaningful choices regarding privacy when they use online platforms.

In extreme cases, as in the case of authoritarian governments, this can lead to dramatic curtailing of freedoms of expression and the right to protest (Article19 , 2021; Wakefield, 2021).

#45. Right Against Self-Incrimination: In a number of countries around the world, the accused are given legal rights against self-incrimination. However, automatic methods of emotion, stance, and deception detection can potentially be used to circumvent such protections (Article19 , 2021 page 37).

#46. Right to Non-Discrimination: In a number of countries around the world, people have legal rights against discrimination on grounds such as race, gender, and religion. However, automatic methods of emotion, stance, and deception detection can sometimes systematically discriminate based on these protected categories. Even if ML systems are not fed race or gender information directly, studies have shown that they often pick up on proxy attributes for these categories. Evaluate and report disaggregated results as appropriate. (See #47 Disaggregation.)

J. IMPLICATIONS FOR SOCIAL GROUPS

(cuts across Task Design, Data, Method, Impact and Evaluation)

Left: Book cover of “Invisible Women”. Right: Figure from Model Cards paper showing disaggregated results.

#47. Disaggregation: Society has often viewed different groups differently (because of their race, gender, income, language, etc.), imposing unequal social and power structures (Lindsey, 2015). Even when the biases are not conscious, the unique needs of different groups is often overlooked. For example, Perez (2019) discusses, through numerous examples, how there is a considerable lack of disaggregated data for women and how that is directly leading to negative outcomes in all spheres of their lives, including health, income, safety, and the degree to which they succeed in their endeavors. This holds true (perhaps even more) for transgender people. Thus emotion researchers should consider the value of disaggregation at various levels, including:

When creating datasets: Obtain annotations from a diverse group of people. Report aggregate-level demographic information. Rather than only labeling instances with the majority vote, consider the value of providing multiple sets of labels as per each of the relevant and key demographic groups.
When testing hypotheses or drawing inferences about language use: Consider also testing the hypotheses disaggregated for each of the relevant and key demographic groups.
When building automatic prediction systems: Evaluate and report performance disaggregated for each of the relevant and key demographic groups. (See Model Cards. See how sentiment analysis systems can be systematically biased.)

#48. Intersectional Invisibility in Research: Intersectionality refers to the complex ways in which different group identities such as race, class, neurodiversity, and gender overlap to amplify discrimination or disadvantage. Purdie-Vaughns and Eibach (2008) argue how people with multiple group identities are often not seen as prototypical members of any of their groups and thus are subject to, what they, call intersectional invisibility — omissions of their experiences in historical narratives and cultural representation, lack of support from advocacy groups, and mismatch with existing anti-discrimination frameworks. Many of the forces that lead to such invisibility (e.g., not being seen as prototypical members of a group) along with other notions common in the quantitative research paradigm (e.g., the predilection to work on neat, non-overlapping, highly populous categories) lead to intersectional invisibility in research. As ML/NLP researchers, we should be cognizant of such blind spots in our fields of study and work to address these gaps. Further, new ways of doing research that address the unique challenges of doing intersectional research need to be valued and encouraged.

#49. Reification and Essentialization: Some demographic variables are essentially, or in big part, social constructs. Thus, work on disaggregation can sometimes reinforce false beliefs that there are innate differences across different groups or that some features are central for one to belong to a social category. Thus it is imperative to contextualize work on disaggregation. For example, by impressing on the reader that even though race is a social construct, the impact of people’s perceptions and behavior around race lead to very real-world consequences.

#50. Attributing People to Social Groups: In order to be able to obtain disaggregated results, sometimes researchers need access to demographic information (of the people whose data is being analyzed). This of course leads to considerations such as: whether they are providing meaningful consent to the collection of such data and whether the data being collected in a manner that respects their privacy, their autonomy (e.g., can they choose to delete their information later), and dignity (e.g., allowing self-descriptions).

Challenges persist in terms of how to design effective and inclusive questionnaires (Bauer et al., 2017; Group, 2014). Further, even with self-report textboxes that give the respondent the primacy and autonomy to express their race, gender, etc., downstream research often ignores such data or combines information in ways beyond the control of the respondent.

Large text saying “Counting the Countless” — Article by Os Keyes on how data science negatively impacts trans and non-binary people.

Some work tries to infer aggregate-level group statistics automatically. For example, inferring race, gender, etc. from cues such as the type of language used, historical name-gender associations, etc. to do disaggregated analysis. However, such approaches are fraught with ethical concerns such as misgendering, essentialization, and reification. Further, historically, people have been marginalized because of their social category, and so methods that try to detect these categories raise legitimate and serious concerns of abuse, erasure, and perpetuating stereotypes.

In many cases, it may be more appropriate to perform disaggregated analysis on something other than a social category. For example, when testing face recognition systems, it might be more appropriate to test the system performance on different skin tones (as opposed to race). Similarly, when working on language data, it might be more appropriate to analyze data partitioned by linguistic gender (as opposed to social gender). See Cao and Daume III (2020) for a useful discussion on linguistic vs. social gender and also for a great example to create more inclusive data for research.

Bonus Consideration

Outreach and Awareness: As an expert and/or creator of a technology, an often overlooked and undervalued responsibility is to convey the broad societal impacts and the ethical considerations of the relevant technology to those that deploy the technology, those that make policy decisions about the technology, and the society at large. I hope that this sheet helps to that end for emotion recognition, and also spurs the wider community to ask and document:

What ethical considerations apply to my task?

FAQ

Q. As an academic researcher interested in AER can I ignore things that are more relevant in a deployment context?

A. Even if you are the researcher you should be thinking about ethical considerations of your system if deployed. You should also communicate these issues through documentation to anyone who might want to deploy your system. Sometimes this can lead to you/others finding better design solutions that minimize negative implications if deployed.

Navigation Links.
Sections: Modalities & Scope, Task, Applications, Ethical Considerations.
Subsections of Ethical Considerations: Task Design, Data, Method, Impact, Privacy & Social Groups.

Acknowledgments: Huge thank you to Mallory Feldman for her belief in the need and value of the ethics sheet for emotion recognition. Discussions with her on the psychology and complexity of emotions were invaluable in shaping the ethics sheet for automatic emotion recognition. Many thanks to Annika Schoene, Roman Klinger, Rada Mihalcea, Maria Liakata, and Emily Mower Provost for discussions about ethical considerations for emotion recognition and thoughtful comments. Many thanks to Tara Small, Emily Bender, Esma Balkir, Isar Nejadgholi, Patricia Thaine, Brendan O’Connor, Cyril Goutte, and Sowmya Vajjala for thoughtful comments on early drafts of the blog posts.

Paper

Ethics Sheets for AI Tasks. Saif M. Mohammad. arXiv preprint arXiv:2107.01183. July 2021.

Feedback

The author welcomes feedback and suggestions, including: disagreeing views, additional considerations to include, and any suggestions for improving this ethics sheet.

Contact: Dr. Saif M. Mohammad: saif.mohammad@nrc-cnrc.gc.ca
Senior Research Scientist, National Research Council Canada
Webpage: http://saifmohammad.com