The Gutenberg Project: Natural Language Processing

Shelvi Garg
Nerd For Tech
Published in
6 min readMay 3, 2021

A Complete Beginners Guide to Natural Language Processing Gutenberg Project

Image Ref: Unsplash

ABOUT PROJECT GUTENBERG:

Project Gutenberg is a volunteer effort to digitize and archive cultural works, to “encourage the creation and distribution of eBooks”. It was founded in 1971 by American writer Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books.

It’s a repository of Over 60,000 books.

Link to the project: https://www.gutenberg.org/

OUR TASK:

Patterns within the written text are not the same across all authors or languages. This allows linguists to study the language of origin or potential authorship of texts where these characteristics are not directly known such as the Federalist Papers of the American Revolution

In this blog, we will examine the properties of individual books in a book collection from various authors and various languages. More specifically, we will look at book lengths, a number of unique words, and how these attributes cluster by the language of or authorship.

Let’s Start :)

Defining count_words: Counting words in a sentence

text = "Hi, welcome to the project Gutenberg."

Let’s define a function to count words in a sentence.

def count_words(text):
"""Count the number of times each word occurs in text. Show number count in dictionary.Skip punctuation.Lower case"""
#adding docstring to function
word_counts = {}

for word in text.split(" "): #splitted the words with blanks and do a loop over loop
if word in word_counts:
word_counts[word] += 1
else:
word_counts[word] = 1
return word_counts
print(count_words(text))
Output: image by author

Addressing the issues

Looking at the dictionary, one obvious shortcoming of our current routine is that it includes punctuation like periods or full stops, it does not consider caps as part of the word. This would lead to an inflation of the word count.

To address these issues, we’re first going to turn the text into a lower case.

Addressing punctuation is a bit more complex. Our strategy is to first specify all the punctuation marks that we’d like to skip, and then loop over that container and replace every occurrence of a punctuation mark with an empty string.

text = "Hi, this is Project Gutenberg. Nice to meet you. This is a tutorial blog of Gutenberg. "def count_words(text):
"""Count the number of times each word occurs in text. Show number count in dictionary.Skip punctuation.Lower case""" #adding docstring to function
text = text.lower()
skips = [".",",", ":", ";","'",'"'] #fixing the problem
for ch in skips:
text = text.replace(ch,"")
word_counts = {}
for word in text.split(" "): #splitted the words with blanks and do a loop over loop
if word in word_counts:
word_counts[word] += 1
else:
word_counts[word] = 1
return word_counts
print(count_words(text))
Output: Image by Author

It’s useful to be able to write your own counting routine like we just did.

However, counting the frequency of objects is such a common operation that Python provides what is known as a counter tool to support rabbit tallies. We first need to import it from the collections module, which provides many additional high-performance data types.

Updated Fast code: Using counter function

from collections import Counterdef count_words_fast(text):
"""Count the number of times each word occurs in text. Show number count in dictionary.Skip punctuation.Lower case""" #adding docstring to function
text = text.lower()
skips = [".",",", ":", ";","'",'"']
for ch in skips:
text = text.replace(ch,"")
word_counts = Counter(text.split(" "))
return word_counts
print(count_words_fast(text))
Output: Image by Author
count_words_fast(text) == count_words(text)   #True but now faster

Output: True

count_words_fast(text) is equal to count_words(text) but is now much faster

READING A COMPLETE BOOK

Character encoding refers to the process of how the computer encodes certain characters. In this case, we’ll use what is called UTF-8 encoding, which is the dominant character encoding for the web.

You can download any book from the Project Gutenberg website. In this blog, I have particularly referred to Shakespeare Books.

You can also download the book I refer from here: https://github.com/shelvi31/Project-Gutenberg-Language-Processing

def read_book(title_path): 
"""Read a book and return it as a string"""
with open(title_path,"r",encoding = "utf8") as current_file:
text = current_file.read()
text = text.replace("\n","").replace("\r","")
return text

Inputting location of downloaded utf8 file from Gutenberg site

text1 = read_book(r"C:\Users\Shelvi Garg\Desktop\Code\data-science\BooksPython\English\shakespeare\Romeo and Juliet.txt") 
print(len(text1))

Output: 169275

Using find method: I am looking out for a few words I know exist in my book

ind = text1.find("in a name")        
print(ind)
sample_text1 = text1[ind : ind +1000]
sample_text1
Output: Image by Author

Computing Word Frequency Statistics

We would like to know how many unique words there are in a given book. We’d also like to return the frequencies of each word.

def word_stats(word_counts):
""" return number of unique words and word frequencies"""
num_unique = len(word_counts)
counts = word_counts.values()
return (num_unique,counts)
text1 = read_book(r"C:\Users\Shelvi Garg\Desktop\Code\data-science\BooksPython\English\shakespeare\Romeo and Juliet.txt")
word_counts = count_words(text1)
(num_unique,counts) = word_stats(word_counts)
Image by Author

Reading multiple book files simultaneously

import os
book_dir = (r"C:\Users\Shelvi Garg\Desktop\Code\data-science\BooksPython")
os.listdir(book_dir)

Output: [‘English’, ‘French’, ‘German’, ‘Portuguese’]

I have 4 folders each having different books in a language namely ‘English’, ‘French’, ‘German’, ‘Portuguese’

We first want to generate a list of the directories that are contained within our “BooksPython directory. Since these directories will correspond to different languages, I am going to call the loop variable language

This read all the books I have in the 4 sub-folders under my BooksPython Directory.

Reading Multiple Files

Using Pandas Data-frames for our code:

We first want to generate a list of the directories that are contained within our “Books” directory Since these directories will correspond to different languages, gonna call the loop variable language. We will:

  • Learning how to navigate file directories and read in multiple files/books at once
  • Briefly using pandas, which provides additional data structure and data analysis functionalities for Python
import pandas as pd 
stats = pd.DataFrame( columns = ("language", "author","title", "length","unique")) #table with these 5 coloumns
title_num = 1
for language in os.listdir(book_dir):
for author in os.listdir(book_dir + "/" + language):
# we need to add the new directory that we're currently in which is the language. We can do this by concatenating strings.for title in os.listdir(book_dir + "/" + language + "/" + author):#our first for loop is looping over languages.The second for loop is looping over authors.And the third, the innermost for loop, is loopingover different titles, different books.

input_file = book_dir + "/" + language + "/" + author + "/" + title
text = read_book(input_file)
stats.loc[title_num] = language , author.capitalize() , title.replace(".txt","") , sum(counts) , num_unique
title_num += 1
(num_unique, counts) = word_stats(count_words(text))
Image by Author

Plotting our statistics

We can easily extract specific columns from our pandas table using the names that we’ve given to those columns.

import matplotlib.pyplot as plt
plt.plot(stats.length, stats.unique, "ro")
plt.show();
plt.savefig("gutenberg1.pdf")
Output: Image by Author

Using matplotlib’s loglog:

plt.loglog(stats.length, stats.unique, "go")
plt.show()
plt.savefig("gutenberg2.pdf")

Output:

Image by Author

Plot As Per Book Language

Let’s construct a plot using different colors for different languages.

plt.figure(figsize = (10,10))
subset = stats[stats.language == "English"]
plt.loglog(subset.length,subset.unique,"o",label = "English",color = "crimson")
subset = stats[stats.language == "French"]
plt.loglog(subset.length,subset.unique,"o",label = "French",color = "orange")
subset = stats[stats.language == "German"]
plt.loglog(subset.length,subset.unique,"o",label = "German",color = "forestgreen")
subset = stats[stats.language == "Portuguese"]
plt.loglog(subset.length,subset.unique,"o",label = "Portuguese",color = "blueviolet")
plt.legend()
plt.xlabel("Book Length")
plt.ylabel("Number of unique words")
plt.savefig("language_plot.pdf")
plt.show()
Output: Image by Author

Find the complete Code Refer to my Jupyter Notebook: https://github.com/shelvi31/Project-Gutenberg-Language-Processing

…. and if you like this blog, don’t forget to leave a few hearty claps :)

You can connect with me on LinkedIn

--

--