Data Against Feminicide: Active Learning for Participatory Data Annotation

Niki Karanikola
Data + Feminism Lab, MIT

--

Every year more than 80,000 women of all ages are murdered around the world. A United Nations report conducted in 2021 found that on average, more than five women were killed every hour by a close family member [1]. “Feminicide” is a term to describe fatal gender-related violence (GRV) directed toward people of marginalized gender identities, which holds accountable the state’s inaction on combating this brutality [2[. Unfortunately, government officials fail to register the actual number of feminicides [1]. Therefore, some grassroots organizations around the world have dedicated their work to correctly reporting feminicide data and documenting cases from media sources for the victim’s memorialization and legal recourse. In aims of supporting these important efforts, the Data Against Feminicide project was founded by D’Ignazio, Fumega, and Suárez Val. One part of this project has been developing an email alerts system, in which a machine learning (ML) classifier has been trained to detect which news articles are more likely to describe a feminicide. These articles are queried daily through the Media Cloud API.

For every activist group we collaborate with in the Data Against Feminicide project, we need to train a classifier catered to their context, needs, and language. Hence, a new labeled dataset is created by taking a random sample of approximately 500 news articles with a query term that attempts to capture feminicide keywords. Then members from the activist groups help to manually label all instances. However, this data annotation procedure is time-consuming and emotionally taxing for the annotators who need to read the violent content of all these articles. For this reason, under the supervision of Harini Suresh, we worked on utilizing active learning sampling methods to adaptively identify the maximally-informative data for the classifier’s training phase, which would enable us to reduce the number of annotated data and maintain a high model performance.

What is Active Learning?

Active learning is a methodology that can help facilitate human-in-the-Loop data annotation processes. It’s the ability to identify the data with the highest impact on the training phase of a supervised model, which will then be labeled by the annotators [3]. Hence, the size of the dataset can be reduced significantly. In this experiment, we leveraged pool-based sampling, which is a type of active learning that selectively queries data in a greedy fashion according to a measure of informativeness from a given set of data, also known as the pool.

Experimental Design

In our experiment, we compared three active learning methods, selecting new examples from four labeled datasets and training a model in each iteration. We utilized the Python module, modAL, to implement the active learning methods: entropy sampling, uncertainty batch sampling, and expected error reduction sampling. The results of the active learning sampling methods were compared against a random sampling method. Our model was a Logistic Regression classifier, evaluated with the following metrics: accuracy, area under the receiver operating curve (AUROC/AUC), and precision at k.
Through this simulated experiment, we observed the potential of active learning sampling methods to enhance the data annotation process. Overall, the predictions of the active learning methods on the four labeled datasets exceeded that of the random sampling results. Particularly, a distinguishable improvement of the active learning methods was observed with precision at 25, which achieved a 10% to 20% performance increase in comparison with the random sampling, by employing only 50 to 100 articles depending on the dataset. But that wasn’t prevalent across all metrics, as for some labeled datasets no notable difference between the active learning methods and the random sampling one was seen. Due to this variability, more research needs to be conducted to understand the full potential of active learning in the Data Against Feminicide project, but we still deem it a promising solution that could alleviate the burden on annotators.

Here you can read the full report and here you can browse a powerpoint presentation we created.

References

  1. Dunaiski, M. et al. (2021) Gender-related killings of women and girls (femicide/feminicide). Edited by A. Me, S. Yee, and K. Mingeirou. rep. United Nations Office on Drugs and Crime and UN Women. Available at: https://www.unodc.org/documents/data-and-analysis/briefs/Femicide_brief_Nov2022.pdf (Accessed: March 24, 2023).
  2. Lagarde Y De Los Ríos, M. (2010) Terrorizing Women: Feminicide in the Americas. Edited by R.-L. Fregoso and C.B. Cynthia Bejarano. Durham, NC: Duke University Press.
  3. Monarch, R.(M. (2021) Human-in-the-loop machine learning: Active learning and annotation for human-centered AI. Shelter Island, NY: Manning.

--

--