Using Zipf’s Law To Improve Neural Language Models

___
7 min readMar 10, 2019

Introduction

In this article, I will explain what is Zipf’s Law in the context of Natural Language Processing (NLP) and how knowledge of this distribution has been used to build better neural language models. I assume the reader is familiar with the concept of neural language models.

Code

The code to reproduce the numbers and figures presented in this article can downloaded from this repository.

Dataset

The analysis discussed in this article is based on the raw wikitext-103 corpus. This corpus consist of articles extracted from the set of Good and Featured wikipedia articles and has over 100 million tokens.

Since I lack access to powerful computing resources, I performed this analyses only on the test set. Here’s a sample of what the test set looks like:

The test set has the following properties:

  • Vocabulary size: 17, 733
  • Number of tokens: 196,095

Here’s a list of the 10 most common words and their counts (number of times they appear in the corpus):

and here’s a list of the 10 least common words and their counts:

Zipf’s Law

--

--