Analysing Perplexity and Burstiness in AI vs. Human Text

Siddharth
4 min readSep 27, 2024

--

In this blog post, we will explore various metrics to understand the differences between AI-generated content and human-written content. We will be analyzing an open dataset containing both AI and human text, focusing on perplexity, burstiness, Fano factor, and cross-entropy.

Recent Challenges in Academics

Due to the increased usage of large language models by students, it has become increasingly important to distinguish between content produced by an LLM or a human. Some major challenges in academia include:

  • Plagiarism and AI-assisted cheating
  • Lack of critical thinking and reduced creativity/learning
  • Factually incorrect, misleading, or biased information

Various tools like Desklib AI Detector are used by students and universities worldwide for highly accurate AI content detection. Studies have shown that while they are not completely reliable, they can be useful indicators (Elkhatat et al., 2023).

Leveraging Text Complexity Metrics

Perplexity: Measures how well a language model predicts the next word in a sequence. Lower perplexity indicates the model is less surprised by the text, suggesting it might be AI-generated. We are using GPT-2 as the reference model, but other open models can also be used (Perplexity of Fixed-Length Models, n.d.).

def calculate_perplexity_using_stride(text, stride=512):
encodings = tokenizer(text, return_tensors='pt', truncation=True, max_length=1024)
encodings = encodings.to(device)
max_length = model.config.max_position_embeddings
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
total_tokens = 0

for begin_loc in range(0, seq_len, stride):
end_loc = min(begin_loc + max_length, seq_len)
trg_len = end_loc - prev_end_loc
input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
target_ids = input_ids.clone()
target_ids[:, :-trg_len] = -100

with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
neg_log_likelihood = outputs.loss * trg_len

nlls.append(neg_log_likelihood)
total_tokens += trg_len
prev_end_loc = end_loc
if end_loc == seq_len:
break

avg_nll = torch.stack(nlls).sum() / total_tokens
ppl = torch.exp(avg_nll)
return ppl.item()

Burstiness: Captures the variation in sentence lengths, potentially revealing a lack of natural flow in AI-generated text.

def calculate_burstiness(text):
sentences = text.split('. ')
sentence_lengths = [len(tokenizer.encode(sentence)) for sentence in sentences if sentence]
burstiness = np.std(sentence_lengths)
return burstiness

Fano Factor: A measure of burstiness, relating variance to the mean sentence length, providing another measure of text structure.

def calculate_fano_factor(text):
sentences = text.split('. ')
sentence_lengths = [len(tokenizer.encode(sentence)) for sentence in sentences if sentence]
mean_length = np.mean(sentence_lengths)
variance = np.var(sentence_lengths)
fano_factor = variance / mean_length if mean_length > 0 else 0
return fano_factor

Cross-Entropy: Compares the predicted word distribution of a model to the actual distribution in the text, potentially highlighting inconsistencies in AI-generated content.

def calculate_cross_entropy(text):
encodings = tokenizer(text, return_tensors='pt', truncation=True, max_length=1024)
encodings = encodings.to(device)

input_ids = encodings.input_ids
with torch.no_grad():
outputs = model(input_ids)
logits = outputs.logits
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = input_ids[..., 1:].contiguous()

loss = F.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1), reduction='mean')
return loss.item()

Analyzing the AI vs. Human Dataset

We are using a user-uploaded Hugging Face dataset arincon/llm-detect, which contains "text" and "label" (0 = human, 1 = AI). More details about the dataset can be found here.

For a quick analysis, we randomly selected 5,000 human and 5,000 AI-generated texts. We calculated the average values of each of these metrics over these sample texts. You can find the complete code in this Google Colab Notebook.

Results

Our analysis shows distinct average values for each metric when comparing AI and human-written text.

Average Values of each metric for AI and Human Text

The data suggests:

  • Lower perplexity for AI text vs. human text, indicating AI content is typically more predictable.
  • Lower burstiness for AI text, indicating less variation in sentence length.
  • Lower Fano Factor in AI text, indicating less variation in sentence length.
  • Lower cross-entropy in AI text, suggesting human text is more complex and unpredictable.

Key Takeaway

The above results show that, on average, there are natural differences in the values of these metrics for AI and human content. However, it is important to note that while these metrics can provide some insights, they are not a definitive indication that the content can be classified as AI-generated or human-written.

References

Elkhatat, A. M., Elsaid, K., & Almeer, S. (2023). Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. International Journal for Educational Integrity, 19, 17. https://doi.org/10.1007/s40979-023-00140-5

Perplexity of Fixed-Length Models. (n.d.). Hugging Face. Retrieved September 27, 2024, from https://huggingface.co/docs/transformers/en/perplexity

--

--

No responses yet