Uncovering Hidden Patterns Between Routine Medical Appointments and Gender-Based Violence Using AI and Semantic Analysis

Published in

Patrick J. McGovern Foundation

10 min readApr 3, 2024

Vital Strategies’ Data Practice Accelerator project within the Data to Safeguard Human Rights cohort is focused on data integration and textual analysis to identify victims of gender-based violence with the goal of providing health services to those who need it and informing municipalities in Brazil of rates of gender-based violence.

by Olívia Guaranha, Arthur Lorenzi Almeida, Lívia Vicente Dutra, Tiago Torrent, Frederico Belcavello, Ely Matos, Marcelo Viridiano, Sofia Reinach, Renato Teixeira, and Erik dos Santos

During the Patrick J. McGovern Foundation’s Data to Safeguard Human Rights Accelerator program, Vital Strategies Brazil and FrameNet Brasil partnered to leverage data linkage and frame-based textual analysis for the identification of candidate cases prone to suffer from gender-based violence, informing policymakers and healthcare teams of territories and situations in which these cases can occur.

The analyses were conducted using data from Recife, the capital municipality of the state of Pernambuco, in northeast Brazil. Recife has one of the biggest populations and GDPs of its region. The city’s Municipal Health Department partnered with Vital Strategies and shared data on violence, hospitalization, deaths, and digitized medical records from primary healthcare services in order to gather in-depth insights on the gender-based violence (GBV) scenario in the city.

The majority of available health data is categorical, meaning most epidemiological analyses rely on objective data and statistical insights. However, the narratives obtained in routine appointments are important parts of the doctor-patient relationship. In Brazil, this data is stored in open text fields in medical records from the primary healthcare database, called e-SUS-AB.

Open text fields are rarely analyzed due to the complexity of working with language. Computational linguistics methods — FrameNet Brasil’s expertise — are used to annotate text for meaning representations in large sets of open text data. Through semantic analysis of samples of the data, FrameNet’s team modeled frames and lexical items covering the healthcare and violence domains. These frames and lexical items were used to create semantic representations of open text fields, which were then fed into a machine learning model used to look for GBV patterns.

However, in order to implement these findings into real-world data and build a machine learning tool, we needed to link data from different sources in order to get a full picture of violence victims’ path through the healthcare system.

**Frame-to-Frame Relations in the Violence Domain**

Obtaining and Linking Sensitive Data

The Brazilian Unified Health System (Sistema Único de Saúde — SUS) has different information systems that track different types of information: hospitalization, violence notification, medical records, mortality, and others. Throughout these different systems, there is no single ID that allows for the identification of the same individual in different databases. This creates challenges for tracking individuals at risk and helping victims before violence escalates. Moreover, not every field in those systems is parameterized.

Therefore, one important step was to link the different databases used in this project: hospitalization (SIH database), violence notification (SINAN-Violência database), medical records from primary healthcare services (e-SUS AB database), and mortality (SIM database).

Vital Strategies uses a deterministic algorithm built on rules from a combination of key variables and developed regarding the fields available in each analyzed database. The quality of the data is essential to define these rules. The first step was the pre-processing of the data for correction and standardization of variables such as name, mother’s name, date of birth, street, and neighborhood, all used in the rules for record comparison.

Then, new fields containing textual information (e.g., names and addresses) were created for standardization and comparison, such as parsing (separation of the fragments into first name, second name, and so on) and substringing (parts of the fragment such as “Maria” → “Mari”; “Oliveira” → “Veira”).

The new variables underwent a new change to their Soundex code, which transforms words into codes capable of capturing phonetic relations in comparisons, such as the use of “s” or “c” and the use or not of double consonants.

Because the linkage is conducted using personally identifiable information (PII), it was necessary to implement a strict data protection protocol. Data was shared through a secure cloud environment only accessed by the technicians responsible for linking and anonymizing the data. All other analyses were conducted with anonymized information.

After the linkage, the anonymization process was executed through a combination of manual, automatic, and semi-automatic methods. The framework used a combination of named entity recognition (NER) AI models, regular expressions, and fuzzy search to identify potential PII. When multiple methods agreed that a text span contains PII, that part of the text was replaced with a special anonymization tag. When there was no agreement about a span, a tool was used to display the span, and a technician had to evaluate whether the span contained sensitive information. The tool also allowed for manual search and tagging of PII for the cases where the NER models and search algorithms failed.

After that, the linked database, without personal identification, was shared with the computational linguistics team, which was responsible for performing the semantic analysis.

Developing a Machine Learning Tool to Identify GBV-Related Cases

In this project, we chose to use text representations of e-SUS records as inputs to the model, while records from SINAN played a crucial role in classifying these inputs. Information from the other two systems can be used to further classify cases (e.g. Did the violence notification lead to the victim’s death?) and find unreported cases.

Because of that, we focused on building semantic features for the e-SUS records, taking into consideration its fields. However, even after basic preprocessing operations, we were left with 40.333 features, consisting of frames, frame elements, the co-occurrence of those two entities, and, in some instances, lexical units (LUs). These LUs are words or expressions that make features more fine-grained for certain frames. For the Symptoms frame, for example, instead of simply including frame elements, the actual word for the symptom is included (e.g. “pain” and “cramp”). If two of those LUs are related in FrameNet Brasil’s model (e.g. “dehydration” and “hipertermia”) and co-occur in text, that feature also represents that relation.

To reduce the number of features to a more feasible amount, we used principal component analysis (PCA). Each principal component represents a combination of the features, and by retaining 4.000 (nearly a 10x reduction in size), we were able to preserve over 90% of the original data variance.

After the pre-processing of data, we explored the relation between records from the two different information systems, analyzing how entries are distributed over time and comparing those two distributions. From these analyses came two relevant findings. The first was that women with a violence notification tend to have more e-SUS records, considering the majority of women have only one e-SUS record.

**Percentage of women with different number of e-SUS records**

The second was comparing the time difference between records and identifying a strong correlation between SINAN notifications and e-SUS records. The highest point of the distribution is near the day of violence, providing evidence that GBV cases could be identified based on medical records. If the distribution were uniform-like, then violence characteristics in electronic medical records would be even harder to identify.

**Frequency of e-SUS records by difference (in days) to date of violence episode**

These findings help validate the project’s original hypothesis that signs of GBV can be observed in medical records from primary care visits. If women with violence notifications present different patterns when compared to other women, we just need to uncover these patterns in order to develop an accurate early identification tool.

Knowing these patterns exist, we conducted a separate analysis to evaluate the feasibility of training a classification model for this type of data. First, the different classes of e-SUS records had to be formally defined in order to separate cases we know are associated with violence and cases that are not. Causality is hard to establish between the records of the same individual. For that reason, some of these definitions partially rely on approximation rules that can be expanded and updated as the project progresses. So far, e-SUS records have been separated into four different groups:

Violence: any record with an ICD code for aggression or within two days of a SINAN notification or hospitalization/death with the same code;
Not violence: certain ICD codes that have a small probability of being associated with violence, e.g., COVID-19, parasitic diseases, tumors, and some congenital malformations;
Likely violence: any record within 30 days of a notification of violence that doesn’t have an ICD code for aggression;
Undefined: any record that does not fall into one of the previous categories. Not all undefined records are equally likely to be related to violence, and this should be explored in the future.

With those classes properly defined, we can check how well they are separated from one another. This projection showed that e-SUS records that received the violence label are restricted to a specific region of the graph. The only significant overlap of classes is that between violence and likely violence, which is expected and shows that our inference about the relationship between records based on the date of each record is relevant. It also demonstrates how rules can be used as an initial filter for the early identification of GBV.

Those insights affirm the hypothesis that there is a substantial difference in those e-SUS records of victims of GBV and that early identification is feasible. However, the projection also makes explicit some of the challenges when working with this data.

**UMAP projection of e-SUS records color-coded by class**

The fact that data clusters are not easily separable makes sense because of the text genre of an electronic medical record. Different conditions may share symptoms and treatments, and health professionals may use similar language structures to describe different situations. Naturally, some records may represent more prototypical cases of GBV or a medical condition, but near the cluster boundaries, these linguistic and medical differences are harder to define. It highlights the fact that not all records can be easily separated into violence or not; there is a gray area that is difficult to unveil.

With those insights, we trained a Support Vector Machine (SVM) model to classify records. To understand what features are most relevant for health professionals and policymakers, we ranked features by their importance scores for the SVM model. The graph below shows the estimated importance of the top 15 most relevant semantic features. The larger the bars, the more relevant a feature is to identify GBV cases. Note, however, that the features do not necessarily correlate with the violence class: an important feature may strongly correlate with the nonviolence class, and because of that, it is also relevant.

**Top 15 features for the SVM models, ranked by importance.**

Interestingly, the most important feature of the model was the lexical unit for “laboratory exam,” which, being a common demand in routine medical care, could be related to non-GBV records. However, this had to be explored further, considering that GBV cases could also demand exams. The other four relevant features seem to be related to medication: the lexical unit for “prescription” and “administer,” the frame Medicines, and that same frame appearing as the action of another frame.

One interesting feature that also appears among the 15 most important is the lexical units for “anxiety,” which is the LU most likely to be related directly to violence. The lexical unit for “complaint” (“dar queixa”) requires further investigation as the word can be associated with complaining about a health condition or filing a police record since the expressions are similar in Brazilian Portuguese.

Due to its complexity, the model needs constant improvement with permanent input from gender, health, and violence specialists in order to reach its full potential in recognizing patterns of GBV. The Data Practice Accelerator was an opportunity to explore the potential of this data, but the work is ongoing. In addition, a dashboard is being built to visualize the analyses and serve as a tool for local public managers.

Main Findings and Future Work

These analyses have uncovered interesting findings that can help professionals and researchers working with GBV:

Women who are victims of GBV tend to have more records in e-SUS AB, meaning they visit primary healthcare units more often.
Distributing the records over time, we noticed a strong correlation between SINAN notifications and e-SUS records, providing evidence that GBV cases could be identified based on medical records.
There is a systematic increase in the number of visits to the doctor from around 60 days before the SINAN notification to 200 days after.
Evaluating patterns between clusters of records, we noted that e-SUS records that are known to be from women who are suffering violence are overlapped by cases that were labeled as likely associated with violence, validating the hypothesis about the relevance of the dates for the relationship between records.

These insights can be further explored in future research, and they indicate that early identification is feasible. This evidence will serve as a baseline for future studies on GBV developed by Vital Strategies and FrameNet Brasil, who will continue developing a machine learning tool that can help health professionals with the early identification of violence in routine medical appointments in primary healthcare services in Brazil. To learn more, read our full insights report.

Uncovering Hidden Patterns Between Routine Medical Appointments and Gender-Based Violence Using AI and Semantic Analysis

Obtaining and Linking Sensitive Data

Developing a Machine Learning Tool to Identify GBV-Related Cases

Main Findings and Future Work

Written by The Patrick J. McGovern Foundation