Stylometry, Machine Learning, Pence and The Op-Ed

garykac
garykac
Sep 6, 2018 · 5 min read
Hamilton ($10) and Madison

The recent anonymously-authored Op-Ed piece in the New York Times has raised some general interest in the topics of stylometry (the analysis of literary style) and authorship identification (determining who wrote a particular document).

The classic test case for the authorship problem is The Federalist Papers. This is a set of essays published anonymously in 1787–88 to support the newly proposed Constitution of the United States.† Today, we know that the essays were written by Alexander Hamilton, James Madison and John Jay, but the exact authorship for some of the essays was in dispute for many years until authorship identification techniques were able to assign ownership.

Given that these authorship identification techniques exist and that there also exists a set of people who are highly motivated to unmask the author of this Op-Ed, it’s useful to consider how this investigation might be carried out.

Investigating Authorship with Machine Learning

The process of using machine learning (ML) to identify the author of a document is similar to most ML problems — you need to gather data and then train a system specifically for the problem you’re trying to solve.

Here are the general steps:

(1) Gather a list of candidates

First, you need to gather a list of potential candidates who could have authored the document. In the general case this can be problematic since the list can be quite large. Usually, however, you have information about the document (or how it was published) that provides clues that can be used to narrow the list considerably.

In the case of the NYT Op-Ed, the author is identified as a “a senior official in the Trump administration”, which narrows it down considerably (if you believe the attribution).

However, it is necessary to consider that the Times may have chosen to present the anonymous author as a “senior official” to help obscure their identity. But even if we assume that’s the case, the candidate list of “any official working in the White House” is still a manageable size.

This step of choosing the list of candidates to examine is crucial since the assumption being made is that the culprit is included in this list of candidates. While ML systems typically indicate their confidence level in the reported results, it’s easy to ignore this value and accept the output as the truth.

(2) Gather info about each candidate

Once you have your candidate list, the next step is to gather information about the candidates. This entails gathering documents that are known to have been authored by each of the candidates.

This can be challenging if the candidates do not have a large body of available work that can be examined. It’s also problematic if you have lots of data for some candidates, but very little for others.

(3) Build a model for each candidate

Once you’ve gathered the raw information about the candidates, the next step is to build a model from the data. Basically, this means converting the raw data from the documents into a set of features.

For text analysis, these features commonly include unigram (single-word) and bigram (word pair) frequencies. They can also include formatting and punctuation information.

Combining these features together forms the model for each candidate.

This feature selection is a key part of any machine learning model. Including all possible features because they might be useful makes the learning process more difficult and noisy. But removing features that you “know” are not going to be useful is applying bias that may compromise the accuracy of the model.

(4) Apply this model to the anonymously authored document

The final step is to apply the model to the document to find the best stylometric match for the writing style of the author. There will always be a best match, so you will always have someone to point your finger at even if the real culprit was not in your candidate list (oops!).

“Lodestar”

In the Federalist Papers, there was one word that served as a very strong signal to distinguish between Madison and Hamilton. Hamilton tended to use “while”, whilst Madison tended to use “whilst”. Because “whilst” was used in the essays where the authorship was in dispute, it indicated that Madison was the author of those essays.

A similar argument is being made with regards to the term “lodestar” and Vice President Pence being the author of the Op-Ed.

It is worth pointing out the potential problems with that assignment.

(1) While “lodestar” may be an uncommon term, Pence is not the only person to make use of it. In fact, Pence’s use of it has most likely increased the usage of it recently, especially for our target candidates in the White House.

(2) Pence is fairly well-known for using the term “lodestar”, so someone seeking to obscure their authorship could have easily planted that term in the document. See [1] for more information on techniques for obscuring authorship.

With the Federalist Papers, there were many other signals beyond while/whilst that were used to determine authorship. Many of the documents were still confidently assigned to Madison even when this term was removed from the document. To have confidence in attributing this document to Pence, there would need to be other positive signals in addition to this.

Being Practical

Even if these authorship identification tools are being used to identify one or more suspects, it is unlikely they will be used as the sole evidence against anyone they identify.

That information would most likely be used to inform a more traditional investigation: to request phone logs or to find documents that directly incriminate the suspect rather than simply providing stylistic features.

The important thing to note is that, while there is no magic “machine learning” bullet to identify with certainty who wrote a particular document, your authorship can be unmasked by the idiosyncrasies of your writing style. If you want to publish anonymous documents and remain anonymous, you need to takes steps to obscure your writing signature.

In those days, writing essays like this was a common way to engage in public discourse, similar to how Twitter is used in a modern setting, but without the 280 character limit.

[1] “Obfuscating Document Stylometry to Preserve Author Anonymity,” G. Kacmarcik, M. Gamon, 2006. https://www.microsoft.com/en-us/research/publication/obfuscating-document-stylometry-to-preserve-author-anonymity/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade