Authorship Attribution through Markov Chain

Teng
Geek Culture
Published in
4 min readSep 28, 2020

Given a piece of work, how do we identify its author? This piece of work aimed to give a brief idea on authorship attribution using Markov Chain.

Authorship attribution (AA) is the task of identifying the author of a given text. It should not be confused with Authorship Profiling that concern with the author’s information such as age, gender, or ethnicity.

Let’s begin with the assumption:

  1. AA assumes that every individual has his own unique writing style and consistently applied it throughout his work.
  2. We assume that the true author is in the group of candidates author. That is: The prediction model will produce single output from the pool of authors.

Stylometry Features

The main idea behind the AA is that by extracting some measurable textual features (Stylometry features), we can discriminate the true author of the given text that shares similar features.

There are many types of stylometry features such as Lexical features (word), Character features ( letter frequency, n-gram, etc), Syntactic features (error, phrase structure), Semantic features (Synonym, functional), and application-specific features (HTML, use of greeting in emails) (Stamatatos 2009).

Here, we focus on the Characters level feature, it has the advantages of language-independent (at least for most of the languages) and robust to spelling errors (Like typo, as compare to lexical feature).

Markov Chain:

Markov chain is a stochastic process that models the probability of a future state solely by its current state. It was introduced by the Russian mathematician Andrey Markov.

Text can be viewed as a sequence of words or characters, where the next word or character is governed by the stylometric style of the author.

Thus, the Markov chain can be applied and the transition probabilities between characters can be used as an authorship fingerprint to distinguish between authors.

Log-likelihood

The likelihood of the sequence of text is the product of all the probabilities of the characters. For instance, we have the following probabilities:

P(H|T) = 0.02, i.e. the probability of character H given current T is 0.02

P(E|H) = 0.03, i.e. the probability of character E given current H is 0.03

The likelihood of the text: THE is then 0.02 x 0.03 = 0.006

Hence, the total likelihood of a long sequence of text is going to be extremely small, making it hard to compare. Instead, log-likelihood is used.

The likelihood of the text: THE is then log(0.02 x 0.03) = log(0.006) = -2.22

Python Implementation

We will use few books from Project Gutenberg as our training and testing data set.

We start by importing the Gutenberg data set from NLTK

import nltk
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()
Books in the NLTK Gutenberg corpus

There are 18 books in the NLTK Gutenberg corpus, we used ‘Austen-sense.txt’ written by Jane Austen as the testing data and compare it with ‘Austen-emma.txt’ written by Jane Austen and ‘Shakespeare-caesar.txt’ written by Shakespeare as training data.

emma = nltk.corpus.gutenberg.words('austen-emma.txt')
emma = ' '.join(emma)
caesar = nltk.corpus.gutenberg.words('shakespeare-caesar.txt')
caesar = ' '.join(caesar)
sense = nltk.corpus.gutenberg.words('austen-sense.txt')
sense = ' '.join(sense)

We limit the number of states in the Markov Chain to 26 English characters and a blank space character (“ ”).

state = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p","q", "r", "s", "t", "u", "v", "w", "x", "y", "z", " "]#create word2id and id2word dictionary
char2id_dict = {}
for index, char in enumerate(state):
char2id_dict[char] = index

Next, let’s define a function that creates a transition matrix out of a given text

import numpy as npdef create_transition_matrix(text):
transition_matrix = np.zeros((27, 27))

for i in range(len(text)-1):
current_char = text[i].lower()
next_char = text[i+1].lower()

if (current_char in state) & (next_char in state):
current_char_id = char2id_dict[current_char]
next_char_id = char2id_dict[next_char]
transition_matrix[current_char_id][next_char_id] = transition_matrix[current_char_id][next_char_id] + 1
sum_of_each_row_all = np.sum(transition_matrix, 1)

for i in range (27):
single_row_sum = sum_of_each_row_all[i]
if (sum_of_each_row_all [i] == 0):
single_row_sum = 1

transition_matrix[ i,: ] = transition_matrix[ i,: ] / single_row_sum
return transition_matrix

Now, let’s obtained the transition matrix of the training data

TM_emma = create_transition_matrix(emma)
TM_caesar = create_transition_matrix(caesar)

We compare the log-likelihood of the book ‘Austen-sense.txt’ written by Jane Austen and Shakespeare.

log_likelihood_jane = 0
log_likelihood_shakespeare = 0
for i in range(len(sense)-1):
current_char = sense[i].lower()
next_char = sense[i+1].lower()
if (current_char in state) & (next_char in state):
current_char_id = char2id_dict[current_char]
next_char_id = char2id_dict[next_char]
if TM_emma[current_char_id][next_char_id] != 0 and TM_caesar[current_char_id][next_char_id] != 0:
log_likelihood_jane += np.log(TM_emma[current_char_id][next_char_id])
log_likelihood_shakespeare += np.log(TM_caesar[current_char_id][next_char_id])

Result

Therefore, the given text has a higher likelihood written by Jane Austen compared to Shakespeare. We assign the authorship of the given text to Jane Austen. To keep this demonstration straightforward, we only use 1 unknown book and 2 authors. You can expand the size of the training and testing set to be larger.

The complete 1st order Markov chain code uploaded in the GitHub linked: https://github.com/yuanxy33/Authorship-Attribution

--

--

Teng
Geek Culture

Data Scientist | Msc Data Science @ University of Leeds