Analyze public discourse on refugees with cleanNLP

Continuing from a previous post What does UN talk about when it talks about refugees, I’ll be analyzing the action-driven words in UNHCR speeches and also touch on the TV coverage among news media, using cleanNLP and newsflash package in R.

Mentions in intergovernmental organization

I started by annotating the part of speech of the words and extract direct object dependencies where the target noun is not a very common word.

Zooming into top nouns, we can see asylum, repatriation and displacement appear often, and much ado about persecution, plight and crisis.

As for the top verb-noun pairs, ‘Seek asylum’ is the most frequent.

UNHCR speeches

Next I’ll look into what’s covered in the news. Different from print news, we can expect the transcript text to be more colloquial.

Mentions in News Media

I leveraged the immensely powerful GDELT project that monitors print, broadcast, and web news media in over 100 languages from across every country in the world. I used the Newflash package in R. It works with GDELT Television Explorer and thus captures only US TV news at the moment.

We can see CNN has the most coverage on this topic, and the more business-focused channels apparently covered less.

In terms of verb-noun pairs, there are many words associated with refugee more than the other nouns, as TV news might use many different ways to describe situations like plight, displacement, and repatriation instead of using those words directly. Top nouns mentioned also include border, Syria, camp.


Text processing with R

After getting a hang of classical packages like tm, I started dipping my toes into NLP when tidytext come around, which makes the conversion of corpus to table easy and is handy when need to do something simple. Then I chanced upon spacyr which is amazingly fast and robust, providing simple syntax for named entity recognition . Today I learnt about openNLP (which sits on the shoulder all these other great packages) and used it for dependency parsing. While there’s a lot more to explore, it only gets better.


This is #day41 of my #100dayprojects on data science and visual storytelling. Full code on my github. Thanks for reading. Suggestions of new topics and feedbacks are always welcomed.