🌳📖💻#4:😶 — POS-Deletion
Unconditionally Universal Speeches
Look at the thing, check out the code, read below if you’re interested : )
People speak a lot.
Political speeches, for example, tend to be lengthy (but honestly, everyone’s speeches are).
So today, playing with the linguistic concept of Language Universals, I wrote some code that weeds through speeches, taking out everything except nouns and verbs.
Reading the speech in the aftermath allows a maybe pensive, maybe revealing, but most probably just ten-seconds-fun-amusing digest of some past US president’s mumblings.
Hope you’ll enjoy : )
What are Language Universals
Linguistics defines two types of Language Universals for natural human languages: unconditional ones and conditional ones.
And actually it seems their difference is smartly explained in the derivation and semantics of the two words. (Oh, those linguists… 😉 )
While conditional Language Universals rely on some conditions to hold up (e.g. “if a language has inflection, it usually also has derivation”), unconditional Language Universals are true without further prerequisites.
In my code I will focus on one of the unconditional LUs, namely:
Every language has nouns and verbs.
Ok. Easy. So, let’s think this forward…
DISCLAIMER: I’m just self-studying (and having fun), so THE REST OF THESE MUSINGS FROM HERE DOWNWARDS ARE NOTHING BUT SELF-MADE MIND-GAMES FOR FUN AND PROGRAMMING PRACTICE. Just keep that in mind. Also, if you have a comment: feedback and corrections are very welcome!
For today’s project I assumed that nouns and verbs are the essential Parts-of-speech in every language, since they are common to them all.
Next I wanted to see what would happen to my text understanding when stripping away all those “non-essential” POS.
Ready for some process talk? Here we go!
Getting to know NLTK
NLTK is full of pre-loaded corpora, and after watching a introduction I went to work forward with the presidential speeches.
I aimed to practice two basic concepts of NLP:
- Tokenization and
- POS tagging
NLTK has great wrappers for it all, so that both can be achieved in just a few lines of code (check out tokenize_text() and tag_POS()
)
After segmenting the text using word and sentence tokenizers, I attached the POS information to each word. And that’s where the serious preprocessing ends and the fields open up for applying stupid ideas. 😁
I went ahead and substituted all non-nouns/verbs with inconspicuous dots.
. <- yep, like that one.
Nouns and verbs were allowed to stay.
Getting to know presidents
Then I stitched the speech back together. Here’s how the beginning of JFK’s 1962 speech looks like in its essence:
PRESIDENT JOHN F. KENNEDY . ANNUAL ADDRESS TO A JOINT SESSION OF CONGRESS ON THE STATE . THE UNION . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. Mister Sam . Rayburn . . .
. . House . . . . . . . . .
. . . Congress . . Constitution . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . State . . Union . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . North . . South . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . ECONOMY . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . Mr. Khrushchev . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . Congress . . . First . . Manpower Training . Development Act . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Second . . Youth Employment Opportunities Act . . . . . . . . . . . . Americans . . . . . . . . . . . . . . . . Americans . . . . . . . . . Third . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . First . Presidential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Second . Presidential . . . . . . . . . . . . . . . Federal . . . . . . . Third . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Congress . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . World War II .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . Government . . . . . . . . . . . . . . . . . . . . Federal Pay Reform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Federal Budget .
. . . . . . . . . . . . . . . . . . . . . First . . . . . . . . . . . . . Secondly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Third . . . . . . . . . . . . . . . . . . . . . . . . . . .
GETTING AMERICA MOVING . . . . . . . . . . . Budget .
. . . . . . . . . . . . . . .
. A . America . . . . . America . . . . . America . . .
. . . . . . . . . . . . . . . . .
I actually found it surprisingly interesting to look at different president’s speeches after applying the cloze deletion pipeline.
There is quite a lot of meaning that can be deduced after erasing everything but nouns and verbs — even in the above example where only nouns are left over, a certain general topic and mood of the speech can be deduced.
Check out the “full” essential version live on AWS.
It’s also an example for how the auto-generated pages from my code look like. Nice parchment, eh?
Of course, missing words open up a hallway of doors to misinterpretation, so better don’t use it to claim for anything substantial. On the other hand, also an abundance of words does. So I guess we’re just looping back to the fact that languages are messy. 😉
That’s it for today. If you feel like pushing some speeches to go for a run and have them return dotted and exhausted, please go ahead and fetch the code.
Let me know if you’ll find something fun or exciting after filtering for POS!