Using Common Similarity Measures for Text Analysis and Ranking

Published in

Web Mining [IS688, Spring 2021]

7 min readApr 6, 2021

Introduction

Reading can improve human beings, and books play an important role in it. Surprisingly, books allow readers to learn, imagine and feel all kinds of emotions in their house.How did the first books affect humanity, can we classify early books based on similarity, and would early people have advanced faster if they had today’s recommendation technology?

In text analysis, the similarity of two texts can be assessed in its most basic form by representing each text as a series of word counts and calculating distance using those word counts as features. This article will focus on measuring distance among texts by describing the advantages and disadvantages of three of the most common distance measures: Manhattan distance, Euclidean distance, and cosine distance.

Source of Data/Tools

Documents are corrected and annotated versions of archives from the Early English Books Online–Text Creation Partnership, which includes a document for almost every book printed in England between 1473 and 1700. This sample dataset includes all the texts published in 1666 — the ones that are currently publicly available. This includes texts from a variety of different genres on all sorts of topics: religious texts, political treatises, and literary works, to name a few. One thing a reader might want to know right away with a text corpus as thematically diverse as this one is: Is there a computational way to determine the kinds of similarity that one cares about? When you calculate the distances among such a wide variety of texts, will the results “make sense” to an expert ? We’ll try to answer these questions in the exercise that follows.

Code for this article is written in Python3.6 and uses the Pandas (v0.25.3) and SciPy (v1.3.3) libraries to calculate distances, though it’s possible to calculate these same distances using other libraries and other programming languages.

Counting Words

To begin, we’ll need to import the libraries (Pandas, SciPy, and scikit-learn) that we installed, as well as a built-in library called glob.

import glob
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform

From this point, scikit-learn’s CountVectorizer class will handle a lot of the work, including opening and reading the text files and counting all the words in each text. At first,I don’t know how to use this class, then I figure it out that create an instance of the CountVectorizer class with all of the parameters we choose, and then run that model on your texts.

Set input to "filename" to tell CountVectorizer to accept a list of filenames to open and read.
Set max_features to 1000 to capture only the 1000 most frequent words. Otherwise, you’ll wind up with hundreds of thousands of features that will make your calculations slower without adding very much additional accuracy.
Set max_df to 0.7. DF stands for document frequency. This parameter tells CountVectorizer that you’d like to eliminate words that appear in more than 70% of the documents in the corpus.

We use the glob library you imported to create the list of file names that CountVectorizer needs.

# Use the glob library to create a list of file names
filenames = glob.glob("1666_texts/*.txt")
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
vectorizer = CountVectorizer(input="filename", max_features=1000, max_df=0.7)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
wordcounts = vectorizer.fit_transform(filenames).toarray()

We’ve now counted every word in all 142 texts in the test corpus. To interpret the results, you’ll also need to open the metadata file as a Pandas DataFrame.

metadata = pd.read_csv("1666_metadata.csv", index_col="TCP ID")

Adding the index_col="TCP ID" setting will ensure that the index labels for your metadata table are the same as the file keys saved above.

Calculating Distance

Calculating distance in SciPy comprises two steps: first calculate the distances, and then you must expand the results into a squareform matrix so that they’re easier to read and process. It’s called squareform because the columns and rows are the same, so the matrix is symmetrical, or square. The distance function in SciPy is called pdist and the squareform function is called squareform. Euclidean distance is the default output of pdist, so we’ll use that one first. To calculate distances, call the pdist function on your DataFrame by pdist(wordcounts). To get the squareform results, you can wrap that entire call in the squareform function. To make this more readable, I put it all into a Pandas DataFrame.

euclidean_distances = pd.DataFrame(squareform(pdist(wordcounts)), index=filekeys, columns=filekeys)
print(euclidean_distances)

The script will print a matrix of the Euclidean distances between every text in the dataset! In this “matrix,” which is really just a table of numbers, the rows and columns are the same. Each row represents a single document, and the columns represent the exact same documents. The value in every cell is the distance between the text from that row and the text from that column. This configuration creates a diagonal line of zeroes through the center of your matrix: where every text is compared to itself, the distance value is zero.

As an example, let’s take a look at the five texts that are the most similar to Robert Boyle’s Hydrostatical paradoxes made out by new experiments, which is part of this dataset under the ID number A28989. The book is a scientific treatise and one of two works Boyle published in 1666. By comparing distances, you could potentially find books that are either thematically or structurally similar to Boyle’s: either scientific texts or texts that have similar prose sections .

Let’s see what texts are similar to Boyle’s book according to their Euclidean distance. We do this using Pandas’s nsmallest function.

top5_euclidean = euclidean_distances.nsmallest(6, 'A28989')['A28989'][1:]
print(top5_euclidean)

Why six instead of five? Because this is a symmetrical or square matrix, one of the possible results is always the same text. Since we know that any text’s distance to itself is zero, it will certainly come up in our results. We need five more in addition to that one, so six total.

A62436     988.557029
A43020     988.622274
A29017    1000.024000
A56390    1005.630151
A44061    1012.873141

In this step, we’re telling Pandas to limit the rows to the file keys in your Euclidean distance results and limit the columns to author, title, and subject keywords, as in the following table:

print(metadata.loc[top5_euclidean.index, ['Author','Title','Keywords']])

We calculate cosine distance in exactly the way we calculated Euclidean distance, but with a parameter that specifies the type of distance you want to use:

cosine_distances = pd.DataFrame(squareform(pdist(wordcounts, metric='cosine')), index=filekeys, columns=filekeys)

top5_cosine = cosine_distances.nsmallest(6, 'A28989')['A28989'][1:]
print(top5_cosine)

The results for cosine distance should look like the following:

A29017    0.432181
A43020    0.616269
A62436    0.629395
A57484    0.633845
A60482    0.663113

Right away we notice a big difference. Because cosine distances are scaled from 0 to 1 , we can tell not only what the closest samples are, but how close they are.

print(metadata.loc[top5_cosine.index, ['Author','Title','Keywords']])

Conclusion and Limitation

The first list suggesting that our features are successfully finding texts that a human would recognize as similar in Euclidean distance. The first two texts, George Thomson’s work on plague and Gideon Harvey’s on tuberculosis, are both recognizably scientific and clearly related to Boyle’s. But the next one is the other text written by Boyle, which might expect to come up before the other two.

In the second list, only one of the closest five texts has a cosine distance less than 0.5, which means most of them aren’t that close to Boyle’s text. This observation is helpful to know and puts some of the previous results into context. We’re dealing with an artificially limited corpus of texts published in just a single year; if we had a larger set, it’s likely we’d find texts more similar to Boyle’s.The first three texts in the list are the same as before, but their order has reversed. Boyle’s other text, as we might expect, is now at the top of the rankings. And as we saw in the numerical results, its cosine distance suggests it’s more similar than the next text down in this list, Harvey’s. The order in this example suggests that perhaps Euclidean distance was picking up on a similarity between Thomson and Boyle that had more to do with magnitude than it did with their contents . The final two texts in this list, though it is hard to tell from their titles, are also fairly relevant to Boyle’s. Both of them deal with topics that were part of early modern scientific thought, natural history and aging, respectively. As you might expect, because cosine distance is more focused on comparing the proportions of features within individual samples, its results were slightly better for this text corpus.

We could use it as an input for an unsupervised clustering of the texts into groups, and we could employ the same measures to drive a machine learning model. If you wanted to understand these results better, you could create a heatmap of this table itself, either in Python or by exporting this table as a CSV and visualizing it elsewhere.