🌳📖💻#4:😶 — POS-Deletion

Unconditionally Universal Speeches

Look at the thing, check out the code, read below if you’re interested : )

inspiring JFK-essentials
People speak a lot.

Political speeches, for example, tend to be lengthy (but honestly, everyone’s speeches are).

So today, playing with the linguistic concept of Language Universals, I wrote some code that weeds through speeches, taking out everything except nouns and verbs. 
Reading the speech in the aftermath allows a maybe pensive, maybe revealing, but most probably just ten-seconds-fun-amusing digest of some past US president’s mumblings.

Hope you’ll enjoy : )

What are Language Universals

Linguistics defines two types of Language Universals for natural human languages: unconditional ones and conditional ones.

And actually it seems their difference is smartly explained in the derivation and semantics of the two words. (Oh, those linguists… 😉 )

While conditional Language Universals rely on some conditions to hold up (e.g. “if a language has inflection, it usually also has derivation”), unconditional Language Universals are true without further prerequisites.

In my code I will focus on one of the unconditional LUs, namely:

Every language has nouns and verbs.

Ok. Easy. So, let’s think this forward…

DISCLAIMER: I’m just self-studying (and having fun), so THE REST OF THESE MUSINGS FROM HERE DOWNWARDS ARE NOTHING BUT SELF-MADE MIND-GAMES FOR FUN AND PROGRAMMING PRACTICE. Just keep that in mind. Also, if you have a comment: feedback and corrections are very welcome!

For today’s project I assumed that nouns and verbs are the essential Parts-of-speech in every language, since they are common to them all.

Next I wanted to see what would happen to my text understanding when stripping away all those “non-essential” POS.

Ready for some process talk? Here we go!

Getting to know NLTK

NLTK is full of pre-loaded corpora, and after watching a introduction I went to work forward with the presidential speeches.

I aimed to practice two basic concepts of NLP:

  • Tokenization and
  • POS tagging

NLTK has great wrappers for it all, so that both can be achieved in just a few lines of code (check out tokenize_text() and tag_POS())

After segmenting the text using word and sentence tokenizers, I attached the POS information to each word. And that’s where the serious preprocessing ends and the fields open up for applying stupid ideas. 😁

I went ahead and substituted all non-nouns/verbs with inconspicuous dots.
. <- yep, like that one.
Nouns and verbs were allowed to stay.

Getting to know presidents

Then I stitched the speech back together. Here’s how the beginning of JFK’s 1962 speech looks like in its essence:

PRESIDENT JOHN F. KENNEDY . ANNUAL ADDRESS TO A JOINT SESSION OF CONGRESS ON THE STATE . THE UNION . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. Mister Sam . Rayburn . . .
. . House . . . . . . . . .
. . . Congress . . Constitution . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . State . . Union . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . North . . South . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . ECONOMY . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . Mr. Khrushchev . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . Congress . . . First . . Manpower Training . Development Act . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Second . . Youth Employment Opportunities Act . . . . . . . . . . . . Americans . . . . . . . . . . . . . . . . Americans . . . . . . . . . Third . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . First . Presidential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Second . Presidential . . . . . . . . . . . . . . . Federal . . . . . . . Third . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Congress . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . World War II .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . Government . . . . . . . . . . . . . . . . . . . . Federal Pay Reform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Federal Budget .
. . . . . . . . . . . . . . . . . . . . . First . . . . . . . . . . . . . Secondly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Third . . . . . . . . . . . . . . . . . . . . . . . . . . .
GETTING AMERICA MOVING . . . . . . . . . . . Budget .
. . . . . . . . . . . . . . .
. A . America . . . . . America . . . . . America . . .
. . . . . . . . . . . . . . . . .

I actually found it surprisingly interesting to look at different president’s speeches after applying the cloze deletion pipeline.

There is quite a lot of meaning that can be deduced after erasing everything but nouns and verbs — even in the above example where only nouns are left over, a certain general topic and mood of the speech can be deduced.

Check out the “full” essential version live on AWS.
It’s also an example for how the auto-generated pages from my code look like. Nice parchment, eh?

Of course, missing words open up a hallway of doors to misinterpretation, so better don’t use it to claim for anything substantial. On the other hand, also an abundance of words does. So I guess we’re just looping back to the fact that languages are messy. 😉


That’s it for today. If you feel like pushing some speeches to go for a run and have them return dotted and exhausted, please go ahead and fetch the code.

Let me know if you’ll find something fun or exciting after filtering for POS!