Unveiling Brazilian News: Extracting Municipality Mentions from News Articles
I find great joy in working with text, and recently I embarked on an exciting side project involving scraped newspaper articles from a major Brazilian newspaper, Folha de S.Paolo. With a vast collection of daily articles spanning from 2002 to 2012, one of my primary objectives was to extract various entities mentioned within the text. Here, I aim to share my thought process and receive valuable feedback on my methodology, which focuses on identifying Brazilian municipalities mentioned in news articles.
Restricting the Sample to Brazil-Related News
First, out of the universe of articles, I focus solely on news articles related to Brazil. This is accomplished by utilizing the URL of the news article, which often provides indications of whether the content pertains to Brazil or other topics such as Sports, World, BBC, Reuters, Education, Informatics, Finance, and more (perhaps they can be used as a placebo at a later point).
Named Entity Recognition and Filtering
Next, I employ Spacy’s Named Entity Recognition (NER) module to extract entities recognized within the news articles. I filter out entities identified as persons (PER), nationalities, or religious/political groups (NORP), and events, as they are less relevant to our municipality-focused analysis. Instead, I concentrate on entities classified as Org (organizations), GPE (geopolitical entities), and Miscellaneous to minimize errors.
Matching Entities with Brazilian Municipalities
Once the relevant entities are extracted, I search for matches with Brazilian municipalities using regular expressions and official Brazilian Institute of Geography and Statistics (IBGE) names. In cases where an entity corresponds to multiple municipalities, I employ the Levenshtein distance algorithm. This algorithm calculates the minimum number of single-character edits needed to transform one string into another, enabling me to determine the best match.
Addressing Potential Mismatches
It’s important to note that there may be instances of mismatches in our municipality associations. To detect potential over or underrepresentation of municipalities, I take additional steps. Firstly, I extract any Brazilian states mentioned in the news articles, considering both full names and abbreviations (e.g., “Acre” and “AC” for Acre). By cross-referencing municipalities with the mentioned states, I improve the accuracy of their associations.
Common Errors and Special Cases
Certain common errors and special cases require attention. For instance, the term “Planalto” typically refers to “Palacio do Planalto.” However, it can also be the name of various municipalities across different Brazilian states. To address this, I check if “Planalto” co-occurs with “Brasilia” or appears jointly with states like Rio Grande do Norte, Rio Grande do Sul, or Rio de Janeiro. In such instances, I disregard these associations.
Similarly, the term “Uniao” appears in the names of several municipalities in Brazil. To differentiate between them, I examine their co-occurrence with “Brasilia” or “Planalto.” If found, I exclude these mentions as they are not municipality references.
Conclusion
While the approach described in this article shows promise for identifying Brazilian municipalities mentioned in news articles, it is important to acknowledge the challenges and potential sources of error. The methodology outlined here may still encounter measurement errors and overrepresentation of certain municipalities.
As I strive for greater accuracy and reliability in the analysis, your suggestions and insights are invaluable. If you have expertise or ideas to enhance the methodology and address the issues of measurement errors and overrepresentation, I welcome your input.
P.S: Related findings
It is worth noting an interesting finding that emerged during the text extraction part, and it’s not related to entity extractions. I plotted a correlation between Google searches related to corruption and the mentions of corruption in newspaper data. This suggests a fascinating correlation between online search behavior and media coverage of corruption. Further exploration of this relationship would be interesting in further understanding public sentiment and media reporting dynamics.