Understanding Nadsat: an exercise in text mining
I’ve been a fan of Kubrick’s ‘A Clockwork Orange’ for years now, and yet I have managed to successfully avoid Burgess’s brilliant novel for just as long. I think it’s mainly because I kept on hearing how many readers found it frustrating to deal with the excessive use of Nadsat, a Russian-influenced argot invented by Burgess himself.
Finally, one fine January weekend, I picked it up (bought it from Kindle Store, to be precise) and immersed myself in the author’s dystopian world, which was indeed… full of Nadsat. Knowing Russian allowed me to get a grip of the argot pretty fast and I ended up enjoying the book just as much — if not more — then the film.
However, it has left me thinking not only about the story itself, but also about how it was written: ‘how many Nadsat words did Burgess use in the book?’, ‘Which words appear most frequently?’, etc. And so I was on the mission to practice my basic text mining skills, using python, and find out more about this fine piece of literature.
Book: ‘A Clockwork Orange’, A. Burgess, 1963. Published by W. W. Norton & Company in 2011. This edition includes the last chapter not published in the first edition and the author’s introduction ‘A Clockwork Orange Resucked’.
- If you haven’t already, install python (I used v2.7). Don’t forget you’ll need nltk, io, and pattern (though it’s possible to do everything with nltk only). I’m also using Jupyter Notebook, which makes it easier to work with python and see the results right away.
- Get your book ready. I’ve converted mine to .txt, but it must be possible to load other types of files too.
- Download English dictionary / word list from here. Any other dictionary or word list will work too. You can probably achieve the same with nltk’s wordnet / synset, though.
- Download my python notebook here and use Jupyter Notebook to run everything — cell by cell — and enjoy the output. Alternatively, use the code below and execute it in whatever environment you’re working in.
Words in the book: 62157
Unique words: 6113
Nadsat words: 3556
Unique Nadsat words: around 650*
The (likely incorrect) sentiment of the book: 0.049662096335073194 (sentiment) and 0.4820986485626106 (subjectivity/objectivity), which means ‘A Clockwork Orange’ has a rather neutral sentiment — it’s neither very negative (-1.0) nor very positive (1.0). It also means that it’s neither very objective (0.0) nor very subjective (1.0)**. See NOTES for more info.
And finally the top 50 Nadsat words (incl. freq & translation) you should know to better understand the book***:
Nadsat — Frequency — English
veck — 155 — a human, a person
viddy — 132 — to see
horrorshow — 109 — cool, awesome, great
malenky — 99 — small
viddied — 76 — saw, have seen
glazzies — 65 — eyes
goloss — 65 — a voice
gulliver — 65 — a head
litso — 64 — a face
bolshy — 50 — big
skorry — 50 — fast
bezoomny — 45 — crazy
rookers — 44 — hands
slooshy — 43 — to listen
veshch — 41 — a thing
platties — 40 — clothes
droogs — 40 — friends
chelloveck — 39 — a human, a person
krovvy — 37 — blood
vecks — 36 — people
vonny — 35 — smelly, stinky
devotchka — 32 — a girl
ptitsa — 32 — a girl, a chick (lit. a bird in Russian)
grahzny — 31 — dirty
govoreeting — 31 — talking, speaking
malchicks — 31 — boys
smeck — 30 — a laugh
veshches — 30 — things
smecking — 30 — laughing
tolchock — 30 — a push, a kick, a thrust
millicents — 29 — policemen
slovos — 29 — words
ittied — 25 — walked
droog — 25 — a friend
lewdies — 24 — people
britva — 24 — a razorblade
nogas — 23 — legs
ptitsas — 23 — girls, chicks (lit. birds in Russian)
viddying — 22 — seeing, watching
kashl — 21 — a cough
zoobies — 21 — teeth
moloko — 20 — milk
peeting — 20 — drinking
malchick — 20 — a boy
slooshied — 19 — listened
gromky — 19 — loud
millicent — 19 — a policeman
dratsing — 18 — fighting
(oddy) knocky — 18 — alone, on one’s own
I haven’t done this type of book analysis in python before, therefore the methods and code above can definitely use some love and attention. Furthermore, I only did the very basics. There’s so much more to learn about ‘A Clockwork Orange’ and the use of Nadsat, so I hope this is going to inspire someone to do a more thorough analysis of this book.
Take a look at the notes below to understand the shortcomings of this quick analysis.
*This number is not entirely precise, since there are some English words in there that my English word list did not contain. However, it’s pretty safe to say that the book contains around 650 unique words that can be considered as part of Nadsat (e.g. veck, malenky, but also splussshhhh, brrrrrrr, flickflickflicked, horrorshow etc.), even though some of them are simply co-joined English words or variations thereof.
**Now, this method works quite well for regular articles, but due to the peculiar language of the book, there’s a need for a custom model that could not only identify the sentiment of purely Nadsat words, but also of English words that are used as part of Nadsat, and the combinations thereof. For instance, the word horrorshow might not sound very positive in English and yet it has a positive connotation in Nadsat.
**Since Nadsat might look a bit confusing at first, I decided to keep all variations of the words (e.g. plural/singular, past/present tense, etc.) in the top 50, to make it more clear.