Understanding Nadsat: an exercise in text mining

I’ve been a fan of Kubrick’s ‘A Clockwork Orange’ for years now, and yet I have managed to successfully avoid Burgess’s brilliant novel for just as long. I think it’s mainly because I kept on hearing how many readers found it frustrating to deal with the excessive use of Nadsat, a Russian-influenced argot invented by Burgess himself.
Finally, one fine January weekend, I picked it up (bought it from Kindle Store, to be precise) and immersed myself in the author’s dystopian world, which was indeed… full of Nadsat. Knowing Russian allowed me to get a grip of the argot pretty fast and I ended up enjoying the book just as much — if not more — then the film.
However, it has left me thinking not only about the story itself, but also about how it was written: ‘how many Nadsat words did Burgess use in the book?’, ‘Which words appear most frequently?’, etc. And so I was on the mission to practice my basic text mining skills, using python, and find out more about this fine piece of literature.


BOOK

Book: ‘A Clockwork Orange’, A. Burgess, 1963. Published by W. W. Norton & Company in 2011. This edition includes the last chapter not published in the first edition and the author’s introduction ‘A Clockwork Orange Resucked’.


PROCESS

  1. If you haven’t already, install python (I used v2.7). Don’t forget you’ll need nltk, io, and pattern (though it’s possible to do everything with nltk only). I’m also using Jupyter Notebook, which makes it easier to work with python and see the results right away.
  2. Get your book ready. I’ve converted mine to .txt, but it must be possible to load other types of files too.
  3. Download English dictionary / word list from here. Any other dictionary or word list will work too. You can probably achieve the same with nltk’s wordnet / synset, though.
  4. Download my python notebook here and use Jupyter Notebook to run everything — cell by cell — and enjoy the output. Alternatively, use the code below and execute it in whatever environment you’re working in.

RESULTS

Words in the book: 62157

Unique words: 6113

and…

Nadsat words: 3556

Unique Nadsat words: around 650*

The (likely incorrect) sentiment of the book: 0.049662096335073194 (sentiment) and 0.4820986485626106 (subjectivity/objectivity), which means ‘A Clockwork Orange’ has a rather neutral sentiment — it’s neither very negative (-1.0) nor very positive (1.0). It also means that it’s neither very objective (0.0) nor very subjective (1.0)**. See NOTES for more info.

And finally the top 50 Nadsat words (incl. freq & translation) you should know to better understand the book***:

Nadsat — Frequency — English

veck — 155 — a human, a person

viddy — 132 — to see

horrorshow — 109 — cool, awesome, great

malenky — 99 — small

viddied — 76 — saw, have seen

glazzies — 65 — eyes

goloss — 65 — a voice

gulliver — 65 — a head

litso — 64 — a face

bolshy — 50 — big

skorry — 50 — fast

bezoomny — 45 — crazy

rookers — 44 — hands

slooshy — 43 — to listen

veshch — 41 — a thing

platties — 40 — clothes

droogs — 40 — friends

chelloveck — 39 — a human, a person

krovvy — 37 — blood

vecks — 36 — people

vonny — 35 — smelly, stinky

devotchka — 32 — a girl

ptitsa — 32 — a girl, a chick (lit. a bird in Russian)

grahzny — 31 — dirty

govoreeting — 31 — talking, speaking

malchicks — 31 — boys

smeck — 30 — a laugh

veshches — 30 — things

smecking — 30 — laughing

tolchock — 30 — a push, a kick, a thrust

millicents — 29 — policemen

slovos — 29 — words

ittied — 25 — walked

droog — 25 — a friend

lewdies — 24 — people

britva — 24 — a razorblade

nogas — 23 — legs

ptitsas — 23 — girls, chicks (lit. birds in Russian)

viddying — 22 — seeing, watching

kashl — 21 — a cough

zoobies — 21 — teeth

moloko — 20 — milk

peeting — 20 — drinking

malchick — 20 — a boy

slooshied — 19 — listened

gromky — 19 — loud

millicent — 19 — a policeman

dratsing — 18 — fighting

(oddy) knocky — 18 — alone, on one’s own


FINAL WORDS

I haven’t done this type of book analysis in python before, therefore the methods and code above can definitely use some love and attention. Furthermore, I only did the very basics. There’s so much more to learn about ‘A Clockwork Orange’ and the use of Nadsat, so I hope this is going to inspire someone to do a more thorough analysis of this book.
Take a look at the notes below to understand the shortcomings of this quick analysis.


NOTES

*This number is not entirely precise, since there are some English words in there that my English word list did not contain. However, it’s pretty safe to say that the book contains around 650 unique words that can be considered as part of Nadsat (e.g. veck, malenky, but also splussshhhh, brrrrrrr, flickflickflicked, horrorshow etc.), even though some of them are simply co-joined English words or variations thereof.

**Now, this method works quite well for regular articles, but due to the peculiar language of the book, there’s a need for a custom model that could not only identify the sentiment of purely Nadsat words, but also of English words that are used as part of Nadsat, and the combinations thereof. For instance, the word horrorshow might not sound very positive in English and yet it has a positive connotation in Nadsat.

**Since Nadsat might look a bit confusing at first, I decided to keep all variations of the words (e.g. plural/singular, past/present tense, etc.) in the top 50, to make it more clear.