In which I probably invented a whole new branch of mathematics and computer science
As Sammy L once said back in ‘93: hold on to your butts. I hope you clicked this link looking for CONTENT because I’m about to deliver on a scale hitherto undreamt.
I was reading about randomness the other day (like one does), when I stumbled upon the claim that, and I quote, “a truly random string cannot be represented by a shorter string.” Now, faithful readers, you know I’m only a fake Computer Scientist, so my neck-beard does not contain this particular half-masticated morsel. In plain-speak: I do not know whether or not this statement is true. I strongly suspect that it is not — at least, not in such generous terms. While we’re at it, there’s the additional issue that the proverb uses the term “truly random” without explaining the difference between random and truly random.
Regardless of whether this statement is truly true or truly false, I felt the need to run a quick experiment based on these probably not-completely-true assumptions. The idea is simple: if compression tells us something useful about the randomness of string, then can we measure the “randomness” of an author?
Stupid, right?
If it’s unclear: yes, you’ve been duped! This article is mostly about me trying a stupid idea with a semi-erroneous premise to half-prove a literary conjecture.
The experiment
This took all of twenty minutes (less time than writing this article), so don’t judge me too harshly for trying out a stupid idea.
First, I downloaded 20 classic books from Project Gutenberg in raw text. Next, I wrote a script that takes each corpus and compresses it with zlib. The script also generates a “random” string the same length of each book and compresses that string. Then, I compare the compressed book and the compressed string to output a scalar I’ve modestly called “Jordan’s Scalar”. Since I am the inventor of this exquisite scalar and get to pick its definition, I denote Jordan’s Scalar with the Hebrew character “ayin”, or ע. It looks cool and makes me seem smart. Also I think that probably all the Greek characters are taken by real mathematics and the Greeks can shove it anyway.
The hypothesis
Here is the list of books I chose:
Alice in WonderlandThe Count of Monte CristoMoby DickPride and PrejudiceSherlock HolmesA Tale of Two CitiesUlyssesFrankensteinDraculaThe Shakespeare Garden ClubA Picture of Dorian GreyCrime and PunishmentWar and PeaceGrimm's Fairy TalesThe Great GatsbyThe IliadJekyll and HydeThe MetamorphosisThe Adventures of Tom SawyerDon Quixote
A good selection, I know. These books are of of varying lengths (War and Peace is 88x as long as The Shakespeare Garden Club), which is why ע must be normalized by compression versus a string of the same length as the book. Come on folks, this is Statistics 101 — which is the only stats class I’ve ever taken… about 15 years ago… and I think I got a B.
Now I know what you’re thinking: “come on, ע is probably just going to be the same for all books or statistically irrelevant” but — HAH! This is only if you’re a literary pleb that hasn’t read both Harry Potter and Ulysses. My hypothesis, because I am NOT a literary pleb and have read Harry Potter (but still not Ulysses obviously), was that of those 20 books, Ulysses would be the most random. I’ve heard stories of Episode 18, which famously has no punctuation and reads like a Lorem Ipsum generator.
At the bottom of the list — I honestly had no idea. Which of these classics would be the least random?
The results
STAGGERING. If you’re not sitting down, I’d suggest at the very least donning some kneepads because there’s more lorem than of ipsum about this gravy train. Believe it or not, there actually IS a clear winner:
Can you believe this? The winner — the most random book, as hypothesized, and now measured given the definition of ע above — ACTUALLY IS Ulysses.
Good gravy I’m incredible.
At the bottom of the list, surprisingly at first then maybe not quite as surprising given the other books on the list, is Jane Austen’s Pride and Prejudice.
The takeaway
The takeaway here is that you should always take the requisite 20 minutes to write a dumb script about a stupid idea to give you a probably-meaningless result. After all, if you don’t do it, who will???