Ethics Sheet for AER: A Recap of the Tips and Strategies

Saif M. Mohammad
8 min readJul 12, 2021

--

Logo for parkour. Source: Wikimedia. Navigation Links. Back Home: Ethics Sheet for AER
Back to the four sections of the sheet:
Modalities & Scope, Task, Applications, Ethical Considerations.

Task Design

  1. Carefully consider what emotions should be the focus of the work (whether conducting a human-annotation study or building an automatic prediction model). Different emotion tasks entail different ethical considerations.
  2. Communicate the nuance of exactly what emotions are being captured to the stakeholders. Not doing so will mean will lead to the misuse and misinterpretation of one’s work.
  3. Realize that it is impossible to capture the full emotional experience of a person (even if one had access to all the electrical signals in the brain).
  4. Lay out the theoretical foundations for the task from relevant research fields such as psychology, linguistics, and sociology, and relate the opinions of relevant domain experts to the task formulation.
  5. Do not refer to some emotions as basic emotions, unless you mean to convey your belief in the Basic Emotions Theory. Careless endorsement of theories can lead to the perpetuation of belief in ideas that are actively harmful (such as suggesting we can determine internal state from outward appearance — physiognomy).
  6. Realize that various ethical concerns, including privacy, manipulation, bias, and free speech, are further exacerbated when systems that act on individuals. Take steps such as anonymization and realizing information at aggregate levels.
  7. When choosing to work on a particular AER task, ask questions such as: Who will benefit from this work and who will not? Will this work shift power from those who already have a lot of power to those that have less power? How can we reframe or redesign the task so that it helps those that are most in need?
  8. Think about whether the task design is explicitly or implicitly validating a theory not supported by evidence.
  9. Think about how the task can be misused or abused, and how that can be minimized.
  10. Use AER as one source of information among many.
  11. Do not use AER for fully automated decision making. AER may be used to assist humans in making decisions, coming up with ideas, suggesting where to delve deeper, and sparking their imagination.
  12. Disclose to all stakeholders the decisions that are being made (in part or wholly) by automation. Provide mechanisms for the user to understand why relevant predictions were made, and also to contest the decisions.

Data

  1. Data used by AER systems can be examined across various dimensions: size of data; whether it is custom data or data obtained from an online platform; less private/sensitive data or more private/sensitive data; what languages are represented in the data; degree of documentation provided with the data; and so on. All of these have societal implications and the choice of datasets should be appropriate for the context of deployment.
  2. Expressions of emotions through language are highly variable: Different people express the same emotion differently; the same text may convey different emotions to different people. This variability should also be taken into consideration when building datasets, systems, and choosing where to deploy the systems, otherwise the systems do not work for various groups of people, or they do not work well in general.
  3. Variability is common not just for emotions but also for natural language. People convey meaning in many different ways. There is usually no one “correct” way of articulating our thoughts.
  4. Aim to obtain useful level of emotion recognition capabilities without having systematic gaps that convey a strong sense of emotion-expression normativeness.
  5. When using language models or emotion datasets, avoid perpetuating stereotypes of how one group of people perceive another group.
  6. Obtain data from a diverse set of sources. Report details of the sources.
  7. When creating emotion datasets, limit the number of instances included per person. Mohammad and Kiritchenko (2018) kept one tweet for every query term and tweeter combination when studying relationships between affect categories (data also used in a shared task on emotions). Kiritchenko et al., (2020) kept at most three tweets per tweeter when studying expressions of loneliness.
  8. Obtain annotations from a diverse set of people. Report aggregate-level demographic information of the annotators.
  9. In emotion and language data, often there are no “correct” answers. Instead, it is a case of some answers being more appropriate than others. And there can be multiple appropriate answers.
  10. Part of conveying that there is no one “correct” answer is to convey how the dataset is situated in many parameters, including: who annotated it, the precise annotation instructions, what data was presented to the annotators (and in what form), and when the data was annotated.
  11. Release raw data annotations as well as any aggregations of annotations.
  12. If using majority voting, acknowledge its limitations. Acknowledge that it may be missing some/many voices.
  13. Explore statistical approaches to finding multiple appropriate labels, while still discarding noise.
  14. Employ separate manual checks to determine if the human annotations have also captures inappropriate human biases. Such biases may be useful for some projects (e.g., work studying such biases), but not for others. Warn users of which inappropriate biases may exist in the data; and any strategies to deal with them when using the dataset.
  15. When using any dataset, devote time and resources to study who is included in the dataset and whose voices are missing. Take corrective action as appropriate.
  16. Keep a portion of your funding for work on marginalized communities.
  17. Keep a portion of your funding for work on less-researched languages (Ruder, 2020).
  18. Systems that are to be deployed to handle open-domain data should be trained on many diverse datasets and tested on many datasets that are quite different from the training datasets.
  19. Ensure that the terms of service of the source platforms are not violated: e.g., data scraping is allowed and data redistribution is allowed (in raw form or through ids). Check the platform terms of service. Ensure compliance with the robot exclusion protocol.
  20. Take actions to anonymize data when dealing with sensitive or private data; e.g., scrub identifying information. Choose to not work with a dataset if adequate safeguards cannot be placed.
  21. Proposals of data annotation efforts that may impact the well-being of annotators should first be submitted for approval to one’s Research Ethics Board (REB) / Institutional Research Board (IRB). The board will evaluate and provide suggestions so that the work complies with the required ethics standards.
  22. An excellent jumping off point for further information on ethical conduct of research involving human subjects is The Belmont Report. The guiding principles they proposed are Respect for Persons, Beneficence, and Justice.

Method

  1. Consider various dimensions of a method and their importance for the particular system deployment context before deciding on the method. Focusing on fewer dimensions may be okay in a research system, but widely deployed systems often require a good balance across the many dimensions.
  2. AI methods tend to work well for people that are well-represented in the data (raw and annotated), but not so well for others. Documenting who is left-out is valuable. Explore alternative methods that are more inclusive, especially for those not usually included by other systems.
  3. Consider how the data collection and machine learning set ups can be addressed to avoid spurious correlations, especially correlations that perpetuate racism, sexism, and stereotypes.
  4. Work on determining the right balance between collecting more user information and privacy considerations, as appropriate for the context in which the system is deployed.
  5. Systems are often trained on static data from the past. However, perceptions, emotions, and behavior change with time. Thus automatic systems may make inappropriate predictions on current data.
  6. Consider the system deployment context to determine what levels of emotional management and meaningful consent are appropriate.
  7. Consider the carbon footprint of your method and value efficiency as a contribution. Report costs per example, size of training set, number of hyper-parametrs, and budget-accuracy curves.

Impact and Evaluation

  1. All evaluation metrics are misleading. Some metrics are more useful than others.
  2. Some techniques are unreliable that they are essentially pseudoscience.
  3. Some approaches can be unreliable or systematically inaccurate for certain groups of people, races, genders, people with health conditions, people that are on the autism spectrum, people from different countries, etc. Determine and present disaggregated accuracies.
  4. Test the system on many different datasets that explore various input characteristics. See Google’s recommendations on best practices on metrics and testing.
  5. Consider whether the metrics are measuring the right thing.
  6. Responsible research and product development entails actively considering various explainability strategies at the very outset of the project. This includes, where appropriate, specifically choosing an ML model that lends itself to better interpretability, running ablation and disaggregation experiments, running data perturbation and adversarial testing experiments, and so on.
  7. When visualizing emotions, it is almost always important to not only show the broad trends but also to allow the user to drill down to the source data that is driving the trend. One can also summarize the data driving the trend, for example through treemaps of the most frequent emotion words and phrases in the data.
  8. Devote time and resources to identify how the system can be misused and how the system may cause harm because of it’s inherent biases and limitations. Identify steps that can be taken to mitigate these harms.
  9. Recognize that there will be harms even when the system works “correctly”.
  10. Provide mechanisms for contestability that not only allow people to challenge the decisions made by a system about them, but also invites participation in the understanding of how machine learning systems work and it limitations.
  11. Be wary of inauthentic and cursory attention to ethics for the sake of appearances.

Implications for Privacy

  1. Privacy is not about secrecy. It is about personal choice. Follow Dr. Cavoukian’s seven principles of privacy by design.
  2. Consider that people might not want their emotions to be inferred. Applying emotion detection systems en masse — gathering emotion information continuously, without meaningful consent, is an invasion of privacy, harmful to the individual, and dangerous to society.
  3. Soft-biometrics also have privacy concerns. Consider implications of AER on group privacy and that a large number of people disfavour such profiling.
  4. Obtain meaningful consent as appropriate for the context. Working with more sensitive and more private data requires a more involved consent process where the user understands the privacy concerns and willingly provides consent.
  5. Consider harm mitigation strategies such as: anonymization techniques (beware that these can vary in effectiveness) and differential privacy.
  6. Keep information on people secure.
  7. Obtain permission to provide data to third parties or for application of data for secondary use cases.
  8. When working out the privacy–benefit tradeoffs, consider who will really benefit from the technology. Especially consider whether those who benefit are people with power or those with less power. Also, as Dr. Cavoukian says, often privacy and benefits can both be had, “it is not a zero-sum game”.
  9. Consider implications of AER for mass surveillance and how that undermines right to privacy, right to freedom of expression, right to protest, right against self-incrimination, and right to non-discrimination.

Implications for Social Groups

  1. When creating datasets, obtain annotations from a diverse group of people. Report aggregate-level demographic information. Rather than only labeling instances with the majority vote, consider the value of providing multiple sets of labels as per each of the relevant and key demographic groups.
  2. When testing hypotheses or drawing inferences about language use, consider also testing the hypotheses disaggregated for each of the relevant and key demographic groups.
  3. When building automatic prediction systems, evaluate and report performance disaggregated for each of the relevant and key demographic groups.
  4. Consider and report the impact of intersectionality.
  5. Contextualize work on disaggregation: for example, by impressing on the reader that even though race is a social construct, the impact of people’s perceptions and behavior around race lead to very real-world consequences.
  6. Obtaining demographic information requires careful and thoughtful considerations such as: whether people are providing meaningful consent to the collection of such data and whether the data being collected in a manner that respects their privacy, their autonomy (e.g., can they choose to delete their information later), and dignity (e.g., allowing self-descriptions).

Back Home: Ethics Sheet for AER

Contact: Dr. Saif M. Mohammad: saif.mohammad@nrc-cnrc.gc.ca
Senior Research Scientist, National Research Council Canada
Webpage: http://saifmohammad.com

--

--

Saif M. Mohammad

Saif is Senior Research Scientist at the National Research Council Canada. His interests are in NLP, especially emotions, creativity, and fairness in language.