xKey Anti-Stylometric Keyboard: How Your Typing Can Reveal Your Identity.

Svilen
Svilen's Realm
Published in
7 min readJun 16, 2015

--

The Case for Anonymity

As much as we Americans celebrate identity, anonymity is an integral part of civilized society. Anonymity is a mask, behind which people can openly speak without the fear of judgement, rejection, embarrassment or harm. It has fostered marvelous creations such as Reddit, Wikipedia and Bitcoin. Over the course of history, artists have produced classics behind pseudonyms, philanthropists have made charitable donations, hacktivists have banded together against entities deemed unjust and whistleblowers have exposed wrongdoing- all under the cloak of anonymity. There’s a reason why we vote anonymously, call anonymous police tip lines and answer anonymous workplace surveys.

Depending on time period, geography and government, messengers of certain ideas may face excruciating consequences. Ananta Das, Washiqur Rahman and Avijit Roy were brutally murdered in Bangladesh this year for merely expressing their secular opinions online. Raif Badawi, a Saudi Arabian blogger was sentenced to 1,000 lashes and 10 years in prison for the same crime. And just recently, 17 year old Amos Yee was arrested in Singapore for criticizing Lee Kuan Yew, Singapore’s first prime minister. We need tools to ensure safety for people to propagate ideas, opinions, views and beliefs without the fear of prosecution by oppressive governments or the threat of individuals with opposing views.

Considering Snowden’s revelations, it’s probably safe to assume that nearly everything you do online is being monitored, it’s just a matter of who and for what reason. Although emphasis on secure connections and encryption has increased as a result, there’s an invisible danger which poses a threat to anonymity that very few people are aware of; even if you spoofed your MAC and IP address, used public WiFi to VPN to a computer on the opposite side of the globe and then used Tor to post an anonymous message on a forum, the one thing you’ve likely overlooked which could give away your identity…

is your writing style.

Stylometry at work

The first prominent use dates back to the 15h century; in his 1439 work, Lorenzo Valla proved that the decree Donation of Constantine, supposedly written in the 4th century was actually a forgery. He compared Latin structures and vocabulary used at that time and determined that the document was most likely written sometime in the 8th century. In one instance he noticed the word “satrap” being used to refer to Roman officials when in fact the word had not acquired that context until much later in the 8th century.

Double Falsehood is an 18th century play which has long been considered a forgery as its author Lewis Thobald claimed it was based on a lost Shakespeare play. In 2015 a stylometric analysis detected Shakespeare’s “psychological signature” in the content of the play and confirmed that he was in fact a collaborator.

In 2008, the blueprint documenting the workings of the cryptocurrency Bitcoin was anonymously published online by the pseudonym Satoshi Nakamoto. For whatever reason the author(s) chose to remain unidentified. Fast-forward to December of 2013, when researcher Skye Grey published his fascinating findings in his quest to identify the notorious creator. Grey ran a stylometric analysis on the original whitepaper and then searched the internet to find writings that matched it’s style. The results? Several direct matches in blog posts written by Nick Szabo, a cryptography professor in George Washington University who coincidentally had been developing a bitcoin-like system for the past decade called “bitgold.” Just to give you an idea of the level of analysis, Grey shared the following results which seem to implicate Szabo:

  • Repeated use of “of course” without isolating commas, contrary to convention (“the problem of course is”)
  • The expression “can be characterized”, frequent in Nick’s blog, is found in 1% of all crypto papers.
  • Use of “for our purposes” when describing hypotheses (found in 1.5% of crypto papers)
  • Starting sentences with “It should be noted”(found in 5.25% of crypto papers)
  • Use of “preclude” (found in 1.5% of crypto papers)
  • Expression “a level of “ + noun (“achieves a level of privacy by…”) as a standalone qualifier
  • Expression “timestamp server”, central in the Bitcoin paper, used in Nick’s blog as early as January 2006
  • Repeated use of expression “trusted third party”
  • Expressions “cryptographic proof” and “digital signatures”
  • Repeated use of “timestamp” as a verb

According the Grey, the probability of finding the phrases “it should be noted”, “for our purposes”, “can be characterized” and “preclude” all used by the same researcher, is 0.08%. Szabo denies the claim and while possible that somebody imitated his writing style, the overwhelming circumstantial evidence suggests otherwise but I digress… the point is that a very substantial lead has emerged thanks to stylometry.

It’s not hard to imagine that with today’s Artificial Intelligence, stylometry is not only more accurate than ever, but it provides more insight to the author’s personality and cognitive characteristics as well. With just a minimum of 100 words, IBM’s Watson (first A.I. to defeat a human in the quiz show Jeopardy!) can detect an authors personality characteristics and traits. It uses “..linguistic analytics to extract a spectrum of cognitive and social characteristics from the text data that a person generates through blogs, tweets, forum posts, and more.” It’s publicly open for anyone to try for free; we can only imagine what the premium version has to offer. Curious as always, I pasted the content of this very article, and this is Watson’s analysis of my personality based on my writing:

You are shrewd, skeptical and tranquil.

You are imaginative: you have a wild imagination. You are philosophical: you are open to and intrigued by new ideas and love to explore them. And you are independent: you have a strong desire to have time to yourself.

Your choices are driven by a desire for self-expression.

You are relatively unconcerned with both tradition and taking pleasure in life. You care more about making your own path than following what others have done. And you prefer activities with a purpose greater than just personal enjoyment.

My personality visualization based on the text of this article

As you can see, computers are not only exceptionally capable at quantifying writing styles but have the uncanny ability to use that information in inconceivable ways. An algorithm developed by Stony Brook University can predict the commercial success of a novel based solely on its writing style with an 84% success rate.

Ross Ulbricht aka Dread Pirate Roberts, the mastermind behind the infamous Silk Road site which served as a black market for drugs, weapons and fake documents was also well aware of the potential danger of stylometry being used against him. At the time of his arrest in a San Francisco public library, the FBI captured images of his laptop screen as evidence. Guess what what he had bookmarked- “Science of Stylometry.”

“Science of Stylometry” bookmark on Ross Ulbricht’s laptop at time of arrest

The Solution

In order to defeat stylometric analysis, a piece of writing needs to be essentially scrubbed clean of it’s distinctive style. A software called Anonymouth already exists and does just that. The biggest problem with it however is that it’s out of reach for non-developers as it requires vast amount of technical knowledge and additional software (I spent a good 10 minutes trying to set it up unsuccessfully.)

The second problem is that it’s PC based. Today smartphones and tablets tremendously outnumber desktop computers. We need a solution that caters to mobile platforms and makes it effortless to use across different contexts- messaging apps, email clients, social networks and web browsers.

I designed solution…

What I hope to achieve with this article is to shed light to this ever-increasing threat to anonymity which most people are not aware exists. It would be great if any developers are interested in the project, but the important thing to keep in mind is that as A.I. advances, so does the possibilities to indirectly identify dissidents, activists, whistleblowers and freethinkers. We need to think of ways to apply these principles not just writing but audio and video as well. Software that scrambles our voices and and facial features in a way that’s non-identifiable even by the most advanced algorithms. I believe we should be able to freely speak our minds without fear. Technology is a double edged sword; one side can be used to eliminate anonymity, but it’s time we sharpen the opposite side and preserve our right to be masked.

Check out the full project.

--

--