Islamophobia and Natural Language Processing update (August 12, 2020)

Ted Pedersen
7 min readAug 12, 2020

--

Back in May I had the very good fortune to participate in the LREC workshop : Data Statements for NLP — Towards Best Practices. A Data Statement is meant to provide a more complete and discursive record of a data set, up to and including information about how a data set was created and who created it. The hope is that a more complete description of a data set will make us more aware of possible biases and mispresentations that may be found in that data.

While we still don’t have a data set to release, I prepared a Data Statement for our data set in creation, which is tentatively named the Ilhan Omar Islamophobia data set. It is intended to be an annotated corpus of tweets that may or may not include instances of Islamophobia about or directed towards US House of Representatives member Ilhan Omar (Minnesota 5th District).

The Data Statement that I created during the workshop has been shared on the workshop web site, and I also include it below. My hope is that this statement will give some additional insight into our annotation efforts, and some of the challenges we have faced thus far. We are of course very interested in any comments or suggestions that anyone may have!

Ilhan Omar Islamophobia Data Set

This is a Data Statement for a prospective data set in the early stages of development.

This Data Statement was created as a part of the LREC 2020 Data Statements workshop (May 11–13, 2020). The contributions and suggestions of Jennifer D’Souza, Zahra Sarlak, Viviana Cotik, and Bonaventure Dossou are gratefully acknowledged.

This dataset is in development. Two pilot annotations have been carried out where three annotators labeled the same 100 tweets. Each pilot annotation has used different annotation schemes in an effort to find a reasonable set of labels / categories. Thus far annotator agreement has been low and so we continue to revise the annotation scheme and guidelines.

  • CURATION RATIONALE

The goal is to create a corpus of tweets annotated for Islamophobia (anti-Muslim bias/racism). This is a broad topic with many different manifestations around the world. In order to limit the scope of the corpus the starting point for inclusion of tweets in our corpus is to collect tweets that mention Ilhan Omar (D-MN) by name (Ilhan, Ilhan Omar, IlhanMN), by her Twitter handles (@ilhanmn @ilhan), or by hashtag #Ilhan, #IlhanOmar, #IlhanMN).

The reason for focusing on Ilhan Omar is that she is one of the few Muslims holding high elected office in the USA. She is a member of the US Congress (House of Representatives) and represents an area that includes the city of Minneapolis and nearby suburbs. In addition to being “visibly” Muslim (e.g., wearing a hijab and speaking openly about her faith), Ilhan Omar is a woman, black, Somali, a refugee, and politically progressive. This intersection of identities is inspirational to some and a target for hate among others. She is a frequent target of President Donald Trump and his allies. As such she draws many different forms of hate speech including Islamophobia.

Tweets have been collected since April 10, 2019. As of April 30, 2020 five million tweets directed at or that mention Ilhan Omar have been collected. These tweets have automatically filtered in various ways. The intent of the filtering is to reduce the size of the dataset and to also focus it so that annotating for Islamophobia is tractable. If no filtering is done the amount of Islamophobia is relatively small. While a small overall amount of Islamophobia is a good thing, it makes annotation of a sufficient number of tweets to represent the problem time consuming. However, there is of course a danger that we artificially inflate the degree of Islamophobia by filtering on words that are typically used in Islamophobic discourse (like sharia or terrorist.) Thus our goal with filtering is to use relatively neutral terms that are not strongly associated with Islamophobia.

The most recent filtering is to only retain tweets that include the strings muslim, islam, quran, or koran (as a separate word or part of another, for example tweets that mention islamic or koranic would be retained). These are in principle neutral terms that are relatively unambiguous. This leads to a corpus of approximately 224,000 tweets. This is of course far too many to manually annotate, so thus far we have been randomly selecting 100 tweets at a time to annotate in our pilot studies.

There were 4 labels in our most recent pilot. These are all forms of Islamophobia, but do not represent a complete list.

  • Traitor/Not Loyal — Muslims are not loyal to the country or culture in which they live, and are instead beholden to some external organization or government (potential overlap with Terrorist, Sharia Law).
  • Terrorist/Sympathizer — Muslims believe in terrorism and are either terrorists themselves or support those who are. (potential overlap with Traitor)
  • False Religion — Islam is a false religion with strange, primitive, evil practices.
  • Sharia Law — Muslims do not believe in the existing legal system and want to replace it with Sharia Law. (potential overlap with Traitor).

The overlaps mentioned above create some annotation hurdles. It has proven difficult to find mutually exclusive annotation categories.

  • LANGUAGE VARIETY/VARIETIES

The tweets are in written English as identified by Twitter. The text is noisy and includes emoticons, hashtags, and urls.

The grammar and spelling varies considerably. Note that our filtering is based on a tweet including the correct spelling of muslim, islam, koran, or quran and is therefore eliminating some of the noisier tweets. A tweet that uses mozlem or qoran for example, would not be collected. It may be advisable to allow some fuzziness in the matching of our filtering terms to avoid biasing too strongly in favor of correct spellings.

Very informal analysis of locations (when provided) and profile information about speakers suggests a significant number of tweets originate in the United States. This is just an impression, but based on all these factors the language is primarily en-US.

  • SPEAKER DEMOGRAPHIC

We are not focusing on particular speakers, but rather randomly selecting tweets for annotation from a much larger collection. Tweets are included based on content words and there is no control for other kinds of information or demographics.

Our knowledge of the speakers is limited to anecdotal impressions based on their profile descriptions and names, but based on that it seems like a significant number of the speakers who produce Islamophobia are men who support Donald Trump and MAGA (Make America Great Again, his slogan).

We have considered seeding our corpus with users who follow certain groups or individuals. This is another way of focusing the corpus (followers of Donald Trump, etc). However, it wasn’t clear to us how to select those kinds of groups in a way that would be representative and not overly obvious. For example, if we collected tweets from the followers of a white supremacy group then we would likely observe a high incidence of Islamophobia that may not be representative of any other group.

  • ANNOTATOR DEMOGRAPHIC

There are three annotators:

  • Annotator 1, white, male, Christian by background, age 50–60, US born, PhD
  • Annotator 2, black, male, practicing Muslim, age 20–30, US born, graduate student
  • Annotator 3, white, female, non-denominational, age 20–30, US born, graduate student

Annotators were not compensated and volunteered out of personal interest and concern about hate speech and Islamophobia.

Annotators individually selected 5–10 examples of tweets in each category and shared. This led to some refinement and eventual agreement about the scope of each category. Thereafter each annotator was given the same 100 tweets for annotation. Multiple labels per tweet were allowed.

  • SPEECH SITUATION

Tweets are very time dependent, and are often in response to world events. Attacks by Donald Trump on Ilhan Omar certainly cause a spike in activitity, as does more general discussion about issues such as immigration and a Muslim ban. Tweets are also performative in that the audience is fairly clearly not limited to Ilhan Omar (even when the tweet is directed at her).

The period of tweet collection (starting April 2019 until the present) has been highly polarized with much debate and discussion about myriad issues, immigration and travel bans have been constant themes, as has the US presidential election. Only a modest portion of the tweets (10–20%) are geolocated, although a significant number of users indicate their location in their profile.

  • TEXT CHARACTERISTICS

Tweets are short and usually informal and conversational in nature. There is no consistent terminology used and a high degree of variation. Our filtering process eliminates some of the most noisy data, but it is still very noisy. Twitter language identification is used and results in mostly English tweets.

  • RECORDING QUALITY

Tweets are written. These are accessed via the Twitter API (free version, not paid) which is at times unreliable. Connections can be lost and collection does not necessarily represent a complete set of tweets that match your search criteria. It seems fairly difficult to determine what we have not been able to collect.

  • OTHER / LIMITATIONS

Our method of filtering does not allow for the retention of conversations or threads in Twitter. As a result each tweet is an independent utterance, however, tweets are often parts of implicit or explicit conversations.

Because Ilhan Omar is a woman there is a preponderance of femine pronouns as well as slurs that are commonly directed at women. This might lead to a model that is overly sensitive to Islamophobia directed at women but less aware of that directed towards men.

Beyond gender, it is unclear how far this data will generalize. Ilhan Omar is a unique figure and it is possible that she may draw expressions of Islamophobia that are specific to her. However, given the wide variation in the expression of Islamophobia around the world some kind of centering anchor for collection seems necessary.

One possible extension would be to select another focus, carry out a similar annotation and compare the resulting categories and models to see what aspects generalize and which do not.

  • PROVENANCE APPENDIX

This dataset is “original” and is a sampling of tweets.The annotation scheme draws on previous work in Islamophobia both in a computational setting and also more qualitative work.

2019 updates:

  • July (project kickoff)
  • August (background reading, Ilhan Omar, Minnesota)
  • December (background reading, Genocide)

2020 Updates

  • May (annotation scheme)

Please stay in touch!

--

--

Ted Pedersen

Computer Science professor at the University of Minnesota, Duluth. Natural Language Processing and Computational Linguistics. http://www.d.umn.edu/~tpederse