NLP of Bible Chapters and Books — Similarity and Clustering with Python

Published in

Analytics Vidhya

6 min readDec 30, 2019

The Bible is one of my favorite books. Apart from religions, the intent here is to try to understand better the context of when the Bible was written and if the series of divisions (Chapters, New / Old Testament, Gospels) makes sense in terms of NLP.

Web Scraping and Preprocessing — NET Bible

Data was Scraped from the NET Bible portal, a new free translation made by a group of scholars that comes with many notes made by them. I really enjoy reading this version, so I chose it to use in the analysis.

Regarding frameworks, Selenium was used on Python. The code and the main issues in that work will be published in a new story in the future. Here, we only load the data.

df = pd.read_csv('data\\Original\\netBible.csv', index_col=0)

The data was organized in chapters. For the analysis of books, the chapters were concatenated.

df.book = [r.replace('netText_', '') for r in df.book]
df['chapter'] = [r.split('_')[1] for r in df.book]
df['book'] = [r.split('_')[0] if len(r.split('_')) == 2 else r.split('_')[0] + '_' + r.split('_')[1] for r in df.book]

We also created a feature to identify the Testament as old or new. As the rows had the same order as in the bible, we just chose the first book of the New Testament and labeled all the other rows below.

df['testament'] = 0
firstMat = df.loc[df['book'] == 'Matthew', :].index[0]
df.loc[firstMat:, 'testament'] = 1

Finally, we replaced the new line char (‘\n’) and removed the stopwords.

df.text = [re.sub(r'\d+', '', r.replace('\n', ' ')) for r in df.text]
for t in stopwords.words('english'):
    df.text = [s.replace(' ' + t + ' ', ' ') for s in df.text]

Vectorization — Transforming text in numbers

The vectorization method used here was the Term Frequency — Inverse Document Frequency (TF-IDF). This method of vectorization extracts the frequency of each term in a text and adds the inverse of each term frequency among all documents. Specifically, we used the sklearn package. The code used to create the vectorizer is below.

vec = TfidfVectorizer()
vecRes = vec.fit_transform(df.text).toarray()

Similarity

After transforming text to numbers (vectors) we wanted to calculate how much all the books are similar to each other. For that, we used the cosine similarity with the data provided by the TF-IDF.

# Similarity of Chapters
vec = TfidfVectorizer()
vecRes = vec.fit_transform(df.text).toarray()
simRes = cosine_similarity(vecRes)
## Testaments
vecResOT = vec.fit_transform(df[df.testament==0].text).toarray()
vecResNT = vec.fit_transform(df[df.testament==1].text).toarray()
simResOT = cosine_similarity(vecResOT)
simResNT = cosine_similarity(vecResNT)


# Similarity of Books
vec = TfidfVectorizer()
df['_'] = ' '
vecRes_books = vec.fit_transform(df.groupby('book')[['text', '_']].agg('sum').text).toarray()
simRes_books = cosine_similarity(vecRes_books)
## Testaments
vecRes_booksOT = vec.fit_transform(df[df.testament==0].groupby('book')[['text', '_']].agg('sum').text).toarray()
vecRes_booksNT = vec.fit_transform(df[df.testament==1].groupby('book')[['text', '_']].agg('sum').text).toarray()
simRes_booksOT = cosine_similarity(vecRes_booksOT)
simRes_booksNT = cosine_similarity(vecRes_booksNT)

The results showed that, for chapters, the average similarity is around 0,68% and the maximum value found was 97.24%. For books, 0.27% was the average and 91.31% the maximum value.

chaptersSim = pd.melt(pd.DataFrame(simRes)).value.drop_duplicates()
booksSim = pd.melt(pd.DataFrame(simRes_books)).value.drop_duplicates()

fig, ax = plt.subplots(nrows=1, ncols=2)

chaptersSim.hist(ax=ax[0])
ax[0].set_title('Chapters')
booksSim.hist(ax=ax[1])
ax[1].set_title('Books')
plt.show()

The analysis by testaments indicated that these groups may not influence the similarity results, however, more tests will be done in this topic and published in a new story, in the future.

chaptersSimOT = pd.melt(pd.DataFrame(simResOT)).value.drop_duplicates()
booksSimOT = pd.melt(pd.DataFrame(simRes_booksOT)).value.drop_duplicates()
chaptersSimNT = pd.melt(pd.DataFrame(simResNT)).value.drop_duplicates()
booksSimNT = pd.melt(pd.DataFrame(simRes_booksNT)).value.drop_duplicates()

fig, ax = plt.subplots(nrows=2, ncols=2)

chaptersSimOT.hist(ax=ax[0, 0])
ax[0,0].set_title('Chapters - Old Testament')
chaptersSimNT.hist(ax=ax[1, 0])
ax[1,0].set_title('Chapters - New Testament')
booksSimOT.hist(ax=ax[0,1])
ax[0,1].set_title('Books - Old Testament')
booksSimNT.hist(ax=ax[1,1])
ax[1,1].set_title('Books - New Testament')
plt.show()

These results indicate that, although all of these books are in the same group, they may be not so similar.

Clustering

For the clustering task, the dimensions were reduced with the Principal Component Analysis (PCA). This technique makes possible to reduce the total number of features to as much as we want. In this case, the dataset was reduced to 2 features.

dr = PCA(n_components=2)
pcaDF = pd.DataFrame(dr.fit_transform(vecRes))
pcaDF_books = pd.DataFrame(dr.fit_transform(vecRes_books))

After that, every chapter/book had only two numerical features, making it possible to plot the results in a scatter plot.

fig, ax = plt.subplots(nrows=2, ncols=1)
fig.set_size_inches(14, 12)

# Chapters
ax[0].scatter(pcaDF.loc[df.testament == 0, :].iloc[:, 0], pcaDF.loc[df.testament == 0, :].iloc[:, 1], label='Old')
ax[0].scatter(pcaDF.loc[df.testament == 1, :].iloc[:, 0], pcaDF.loc[df.testament == 1, :].iloc[:, 1], label='New')
ax[0].set_title('Chapters PCA')
ax[0].legend()

# Books
ax[1].scatter(pcaDF_books.loc[testament.testament == 0, 0], pcaDF_books.loc[testament.testament == 0, 1], label='Old')
ax[1].scatter(pcaDF_books.loc[testament.testament == 1, 0], ax[1].set_title('Books PCA')
ax[1].legend()
plt.show()

By analyzing the graph it is possible to notice that three major groups were formed. We will need to plot the Books graph alone and with names so we can have a better understanding of results.

# Create object with book names
testament = df.groupby('book')[['testament']].agg('mean')
testament.index = range(0,66)
bookNames = df.groupby('book').agg('sum').index# Plot
fig = plt.figure()
fig.set_size_inches(14, 8)

pcaDF_books = pd.DataFrame(dr.fit_transform(vecRes_books))
plt.scatter(pcaDF_books.loc[testament.testament == 0, 0], pcaDF_books.loc[testament.testament == 0, 1], label='Old')
plt.scatter(pcaDF_books.loc[testament.testament == 1, 0], pcaDF_books.loc[testament.testament == 1, 1], label='New')
# Labels moved down
lbBelow = ['Luke', 'Judges', 'Daniel', '2_Kings', 'Haggai', 'Amos', '1_Thessalonians', 'Colossians', '2_Peter', 'Exodus', 'Joel', 'Zephaniah', 'Habakkuk']# Labels moved down and left
lbBelowL = ['Lamentations', '1_Peter', '2_Timothy']# Labels moved left
lbLeft = ['1_Chronicles', '1_Corinthians', '2_Corinthians']for p in pcaDF_books.index:
    print(bookNames[p])
    if bookNames[p] in lbBelow:
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(0, -18),
                     textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    elif bookNames[p] in lbBelowL:
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(-30, -18),
                     textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    elif bookNames[p] in lbLeft:
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(-15, 10),
                     textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    elif bookNames[p] == '2_Thessalonians':
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(-30, -36),
                     textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    elif bookNames[p] == 'Deuteronomy':
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(50, 10),
                     textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))pcaDF_books.loc[p, 1]), xytext=(0, 10), textcoords='offset points')
plt.title('Books PCA')
plt.legend()
plt.show()

It is possible to see that the gospels are very close to each other, indicating that these books may be different from all other bible books. Interestingly, the books of Acts are the Gospels closest point.

Finally, we use the k-means algorithm to create groups from these data points. As we can identify clearly 3 groups, we set the n_clusters to 3.

# Train model
kmeans = KMeans(n_clusters=3).fit(vecRes)# Plot results
fig = plt.figure()
fig.set_size_inches(14, 8)

plt.scatter(pcaDF_books.loc[kmeans_books.labels_ == 0, 0], pcaDF_books.loc[kmeans_books.labels_ == 0, 1], label='Group 1')
plt.scatter(pcaDF_books.loc[kmeans_books.labels_ == 1, 0], pcaDF_books.loc[kmeans_books.labels_ == 1, 1], label='Group 2')
plt.scatter(pcaDF_books.loc[kmeans_books.labels_ == 2, 0], pcaDF_books.loc[kmeans_books.labels_ == 2, 1], label='Group 3')# Labels moved down
lbBelow = ['Luke', 'Judges', 'Daniel', '2_Kings', 'Haggai', 'Amos', '1_Thessalonians', 'Colossians', '2_Peter', 'Exodus', 'Joel', 'Zephaniah', 'Habakkuk']# Labels moved down and left
lbBelowL = ['Lamentations', '1_Peter', '2_Timothy']# Labels moved left
lbLeft = ['1_Chronicles', '1_Corinthians', '2_Corinthians']for p in pcaDF_books.index:
    print(bookNames[p])
    if bookNames[p] in lbBelow:
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(0, -18),
                     textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    elif bookNames[p] in lbBelowL:
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(-30, -18),
                     textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    elif bookNames[p] in lbLeft:
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(-15, 10),
                     textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    elif bookNames[p] == '2_Thessalonians':
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(-30, -36),
                     textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    elif bookNames[p] == 'Deuteronomy':
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(50, 10),
                     textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    else:
        plt.annotate(bookNames[p], (pcaDF_books.loc[p, 0], pcaDF_books.loc[p, 1]), xytext=(0, 10), textcoords='offset points', arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))pcaDF_books.loc[p, 1]), xytext=(0, 10), textcoords='offset points')
plt.title('Books K-Means')
plt.legend()
plt.show()

The groups created by the K-Means model were very similar to New/Old/Gospels division with Acts and Ester in the Gospels group. Ecclesiastes, originally an Old Testament book, was among the “New Testament” books and Revelation took the other way around.

Conclusion

If you want to start learning Python for NLP, the codes in this post may help you. However, if you want to start reading the bible, it may be interesting to start by one big group (Old Testament, New Testament or the Gospels) and then go to other close points from where you started. I would recommend you to read the Gospels first, but it is up to you!

Have a good reading/coding!