Using Python to Calculate Similarity Distance Measurement for Text Analysis

Jeremy Langenderfer
Web Mining [IS688, Spring 2021]

--

By: Jeremy Langenderfer

I’m not quite sure if this would be considered a hobby, but one of my favorite things to do is read about world news and events. The main reason that I like to read about world news and events, is that I simply like to stay informed. This has only become a recent interest of mine within the past five to seven years, but one of the strangest things that I began to realize was the similarity of various news articles from different organizations. Also, of particular interest, was the dissimilarity when comparing one news organization to another. They were reporting on the same story, but it was almost like they were reporting on a completely different series of events. It really leads to the conclusion, of “who do we believe”? That is why I chose this particular topic to write about because I found that it was interesting that in using Python, I was able to calculate the similarity and dissimilarity between text documents.

Before getting further into the article, I think it would be important to discuss the three types of similarity distance measurements that I came across during my research. There are three types of similarity distance measurements that I will be discussing. These types of similarity distance measurements are City Block (Manhattan) Distance, Euclidean Distance, and the Cosine Similarity and Cosine Distance. First, I will provide a brief example the City Block (Manhattan) Distance. The formula for calculating the City Block (Manhattan) Distance is |x2 — x1| + |y2 — y1|. Below is an image representing the City Block (Manhattan) Distance.

Figure 1 (Ladd, 2020)

Next, is the Euclidean Distance. “In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore occasionally being called the Pythagorean distance. These names come from the ancient Greek mathematicians Euclid and Pythagoras, although Euclid did not represent distances as numbers, and the connection from the Pythagorean theorem to distance calculation was not made until the 18th century” (Euclidean distance, 2021). The formula for calculating the Euclidean distance is shown below.

Figure 2 (Ladd, 2020)

Last, we have the Cosine Similarity and Cosine Distance measurement. “Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0, π] radians. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90° relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in {\display style [0,1]}[0,1]. The name derives from the term “direction cosine”: in this case, unit vectors are maximally “similar” if they’re parallel and maximally “dissimilar” if they’re orthogonal (perpendicular). This is analogous to the cosine, which is unity (maximum value) when the segments subtend a zero angle and zero (uncorrelated) when the segments are perpendicular” (Cosine similarity, 2021). The formula used to calculate Cosine Similarity and Cosine Distance measurement is shown below.

Figure 3 (Ladd, 2020)

Now that I’ve described a little bit about what each measurement is, it is time to discuss the data that I will be using for this particular project. The data that I obtained from was from the website, https://programminghistorian.org/en/lessons/common-similarity-measures#distance-and-similarity. Specifically, I downloaded a folder labeled, “1666_texts/” that contains one-hundred and forty-two text files. I also downloaded the metadata for those files, which are contained within a csv file named, “1666_metadata.csv”. This file was also obtained from the same website mentioned above.

Before getting started, I will be using some Python libraries that will be necessary in the calculation of the above mentioned measurements. The four libraries that will be used in this project are Pandas, SciPy, scikit-learn, and glob. I am unfamiliar with scikit-learn, so I decided to research a little further. The scikit-learn library is a “library in Python that provides many unsupervised and supervised learning algorithms. It’s built upon some of the technology you might already be familiar with, like NumPy, pandas, and Matplotlib” (What is SCIKIT-LEARN?, n.d.). As such, I was unfamiliar with the glob library, so I did further research on this library. I found that the glob library is, “a general term used to define techniques to match specified patterns according to rules related to Unix shell. Linux and Unix systems and shells also support glob and also provide function glob() in system libraries. In Python, the glob module is used to retrieve files/pathnames matching a specified pattern” (How to use glob() function to find files recursively in python?, 2020). Below is an example of the libraries that I imported into a Python file.

Next, it will be necessary to count every word from the one-hundred and forty-two text files that were downloaded as previously mentioned. Also, to interpret the results, it will be necessary to use the Panda Dataframe to read the file labeled, “1666_metadata.csv”, which was previously downloaded. The below image will illustrate the Python code used for this process.

The next step in this process will be using the SciPy library to begin making some calculations. The first part will be to calculate the distances and then the following step will be inputting the results into a “squareform matrix”. This allows for the results to be easier to read and process. The below image will illustrate the “squareform matrix”.

Once running the Python file thus far, you can now see the results as displayed below.

An explanation of the above returned values is as follows, “Each row represents a single EarlyPrint document, and the columns represent the exact same documents. The value in every cell is the distance between the text from that row and the text from that column. This configuration creates a diagonal line of zeroes through the center of your matrix: where every text is compared to itself, the distance value is zero” (Ladd, 2020).

Next, I wanted to see if I could determine the texts that were most similar using the Euclidean distance to Robert Boyle with a TCP ID of A28989. The below example is the Python code for this calculation.

Unfortunately, this is where I encountered my first bug. This resulted in a KeyError. Below is an example of the error that I encountered.

After spending an extensive amount of time researching and troubleshooting this issue, I have yet to determine a solution to this error. In my research I consulted the information found at https://realpython.com/python-keyerror/#what-a-python-keyerror-usually-means . I went through the troubleshooting steps that were provided from this link, but so far, I have not been able to resolve this problem. If this error was not encountered, the below image provides an example as to how the results should be displayed. In my example, there should be ten. However, in the context of this example, there will only be six.

Figure 4 (Ladd, 2020)

Next, if I wanted to see the details from the above results, such as “Author, Title and Keywords”, I would use the following Python code as an example.

Again, I continue to receive the KeyError, so I will provide an example of how the returned results will, or should appear.

Figure 5 (Ladd, 2020)

The next step will be calculating the Cosine Distance in a similar way to that of the Euclidean Distance. The below image will represent the Python code for this example.

Again, in this example, I am attempting to calculate the top ten Cosine distance to that of Robert Boyle with a TCP ID of A28989. Unfortunately, I continue to receive the KeyError and am unable to view any results. However, the below image represents an example as to the results that should appear.

Figure 6 (Ladd, 2020)

An explanation as to the returned results as illustrated in Figure 5 and Figure 6 is described as follows: “The first three texts in the list are the same as before, but their order has reversed. Boyle’s other text, as we might expect, is now at the top of the rankings. And as we saw in the numerical results, its cosine distance suggests it’s more similar than the next text down in this list, Harvey’s. The order in this example suggests that perhaps Euclidean distance was picking up on a similarity between Thomson and Boyle that had more to do with magnitude (i.e. the texts were similar lengths) than it did with their contents (i.e. words used in similar proportions)” (Ladd, 2020).

In closing, I found this project to be insightful on other useful libraries within Python. My continued effort will be trying to resolve the KeyError issue that I encountered in the middle stages of this project. Lastly, I plan to test this example on calculating distance and similarity among other random text files and expand on my experience using these methods.

Resources

Cosine similarity. (2021, April 14). Retrieved April 16, 2021, from https://en.wikipedia.org/wiki/Cosine_similarity

Euclidean distance. (2021, April 13). Retrieved April 16, 2021, from https://en.wikipedia.org/wiki/Euclidean_distance#:~:text=In%20mathematics%2C%20the%20Euclidean%20distance,being%20called%20the%20Pythagorean%20distance.

Hansen, C. (2021, February 27). Python keyerror exceptions and how to handle them. Retrieved April 17, 2021, from https://realpython.com/python-keyerror/#what-a-python-keyerror-usually-means

How to use glob() function to find files recursively in python? (2020, April 25). Retrieved April 17, 2021, from https://www.geeksforgeeks.org/how-to-use-glob-function-to-find-files-recursively-in-python/#:~:text=Glob%20is%20a%20general%20term,pathnames%20matching%20a%20specified%20pattern.

Ladd, J. (2020, May 05). Understanding and using common similarity measures for text analysis. Retrieved April 16, 2021, from https://programminghistorian.org/en/lessons/common-similarity-measures

What is SCIKIT-LEARN? (n.d.). Retrieved April 17, 2021, from https://www.codecademy.com/articles/scikit-learn#:~:text=Scikit%2Dlearn%20is%20a%20library,NumPy%2C%20pandas%2C%20and%20Matplotlib!

--

--