Predicting Poetic Movements
Analyzing and categorizing poetry to prepare for content-based recommendation
Within written media, poetry is often regarded as enigmatic, frivolous, or too niche. As a result, poems (even by established poets) are often overlooked by larger publishers and literature-focused websites alike. (The anti-capitalist nature of poetry may play a role here as well). There are services for rating and recommending entire books (including poetry collections, to be fair) like GoodReads, Amazon, or Bookish, but to my knowledge, there aren’t any sites or services that recommend poems on an individual level.
With this in mind, I wondered how poem recommendation may even work. One often finds a genre or two that they like and searches that out, but there must be elements of poetry that transcend genre. If there are, machine learning seems like a perfect tool to use to find them. In this article, I’ll explore some features of poetry that make it unique as a style of writing and investigate differences between four umbrella genres I’ll be referring to as “movements”. After building a model, I can create a recommendation system that recommends poetry based on a word, multiple words, or another poem.
The data
With a history that dates back to 1912, the Poetry Foundation is one of the largest purveyors of poetry in the world and a crucial resource for poets and readers alike. I scraped 4,307 poems from their website, each of which was labeled with a genre. There were a total of 13 genres, which I broke down into four movements:
- Pre-1900 (Victorian and Romantic)
- Modern (a standalone category)
- Metropolitan (New York School [1st and 2nd Generation], Confessional, Beat, Harlem Renaissance, Black Arts Movement)
- Avant-Garde (Imagist, Black Mountain, Language Poetry, Objectivist)
By using four roughly balanced categories instead of the original thirteen, I was able to more easily analyze and classify each class of poem. Modern poetry (both the genre and the movement) made up about 29% of the data. Avant-Garde, the movement with the least poems, made up about 22% of the data.
A note on the scraping process
The scraping process presented challenges in that poems came in two forms: HTML-text and scanned images. I was able to use BeautifulSoup to easily capture the text-based ones, but had to rely on PyTesseract for poems from scanned images. While I’m confident that a large majority have been scraped properly, there are undoubtedly some poems that are truncated, contain typos, or have extra lines, merely as a result of the inaccuracies of the image-to-text library. Still, in the name of having more data, using the scanned image poetry was a necessity.
Outline
After a lengthy scraping (and re-scraping) process, I cleaned the data by removing section headers (roman numerals and things like Part 1, Part 2, etc.), empty lines, and any extra lines contain the poet’s name and year of publication. This allowed me to more accurately engineer several features, including the number of lines in the poem, average number of words per line, average number of syllables per word, and lexical richness. I also looked at the polarity and subjectivity of poems.
After feature engineering, I explored the data alongside these new features and processed the text to investigate the most frequently used words. I created a variety of visualizations to support my findings. Finally, I ran several prediction models to provide further insights into what I looked at during my data exploration.
Feature engineering and EDA
I was very excited by the range of features that can be engineered within poetic text, most of which proved very useful in both analysis and classification. Poetry is a unique medium of writing in which structure and form are integral to the style (and sometimes even the meaning) of a poem.
As I will show, Avant-Garde poetry, often seen as a more experimental style and an abject rejection of the past, is almost always at the opposite end of the spectrum as Pre-1900 poetry, which is unsurprising from a literary criticism standpoint. In short, the formal and structural elements that I quantified in this project provide statistical confirmation of well-established literary theories and analysis.
Number of lines
This is the standard measurement of the length of a poem, as opposed to word count. That is why the data cleaning I described earlier was so crucial, in removing any lines that aren’t part of the actual poem itself.
I was surprised to find that the median values were fairly similar across all movements, with the exception of Modern poetry, which had the smallest median value.
Despite these similarities, Pre-1900 poems do tend to be much longer on average. The average length is 55 lines, whereas the next highest, Metropolitan, is only 38. The distribution of the upper quartiles in the chart above further depicts a movement that’s no stranger to a long poem. The lower whisker for Pre-1900 also shows that those poems tend to be at least a few lines long (the minimum was 4), whereas the other movements have no problems with a one-line poem.
Modern poetry tends to be the shortest with both the lowest average (33 lines) and a median that is 4 lines fewer than the next lowest. Avant-Garde and Metropolitan poetries are statistically similar to each other, as are Avant-Garde and Modern poetries.
Average line length (words per line)
Another key metric that greatly affects how a poem appears on the page, as well as how it is read, is the average number of words per line. A poem with a word per line average of two is going to look and feel very different than, say, a sonnet with a word per line average of eight.
One important discovery was how the advent of the prose poem skewed my data. A prose poem is a poem that looks much more like a piece of fiction, using paragraphs or large chunks of text as opposed to the line breaks one usually associates with poems. So some of those one-line poems discussed in the previous section may have simply been a one-paragraph prose poem.
These types of poems became much more prevalent in the 20th Century and are not present in my data’s Pre-1900 category. As a result, the maximum values for average line length in Pre-1900 poetry is 23, whereas that for the other three movements is in the upper hundreds and even well above one thousand.
While this obviously skews the averages of Avant-Garde, Metropolitan, and Modern poetry, their median values tell a different story.
Avant-Garde tends to have the fewest words per line by far, with a median value of about 5.1 words, compared to the next lowest, Metropolitan, at about 6.6 words. Avant-Garde simultaneously happens to have the highest average at 9.3 words per line, which suggests a prevalence of prose poetry within the movement.
Pre-1900 poetry tends to have the longest lines, with a median value of 7.0 words, and also tends to be the most regular, with the smallest range of values. This makes sense given the adherence to established structures such as sonnets and villanelles. It is also worth noting that Pre-1900 poetry has the smallest average value (7.2 compared to the next lowest of 8.3), which is again most likely due to there being no examples of prose poetry.
Polarity
Pre-1900 poetry is overwhelmingly positive, with a median value of .90! In the box-and-whisker plot below, notice the position of the red line compared to the other movements. The other three movements are all similar to each other, and their polarities have no statistically significant differences between them.
Poetry is rarely neutral and tends to be positive; as depicted in the chart below, at least 61% of the poems in each movement have a positive polarity score. 71% of Pre-1900 poems have a positive polarity score.
Avant-Garde poetry contains the most neutral poems at just below 5%, but it’s still a relatively small share.
End rhymes
I was able to use Allison Parrish’s Pronouncing package to determine the number of end rhymes a poem contains. An end rhyme occurs when the word at the end of one line rhymes with another word at the end of a different line. I divided that number by the number of total lines to get a ratio that became one of my classification model’s most important features. (Note: I counted only unique rhymes.)
Unsurprisingly, there is a lot of separation between Pre-1900 poetry and the other movements.
Avant-Garde poetry tends not to use end rhymes, and they are relatively infrequent in Metropolitan poetry. End rhymes are not uncommon in Modern poetry, but they are truly at home in Pre-1900 poetry (and almost seem to be a requirement!), as shown below.
Only 8% of Avant-Garde poems had an end rhyme ratio above 0.25, compared to 85% of Pre-1900 poems.
Complexity of language (syllables per word)
Again using the Pronouncing package, I calculated the average number of syllables per word in each poem. I used this as a measure of the complexity of the language used within a poem; words with more syllables tend to be more complex than words with only one syllable.
I had expected Pre-1900 poetry, with its flowery Victorian-era English, to have a much higher average of syllables per word. Instead, it has the simplest word usage (fewest syllables), whereas Metropolitan has the highest median value, narrowly edging out Avant-Garde.
It’s worth noting that Avant-Garde has the largest range by far, as shown in the above chart. This indicates a varied movement of poems that employ simple and complex language.
Lexical richness
Another measure of complexity of language is lexical richness, which is calculated by dividing a poem’s vocabulary (the number of unique words) by the number of total words in a text. A repetitious poem would have a low value, whereas a poem with a high value (almost or entirely unique words) would be described as “lexically rich”. A poem in which each word appears only once would have a score of 1.0.
Pre-1900 poetry appears to be the most repetitious movement, whereas Avant-Garde poetry is the most lexically rich. In the chart below, Avant-Garde is the only movement where a whisker reaches a value of 1.0, and all of it’s quartiles are well above the other movements.
It’s important to combine a couple of these observations to realize that Pre-1900 is wordy and repetitive, whereas Avant-Garde tends to be concise and full of unique language.
Text processing
I processed the poems in order to get a better look at what words were most frequently used within each movement. To process the text, I:
- made the poems lowercase
- converted contractions to root words
- removed punctuation
- lemmatized
- removed stop words
My stop words included:
- NLTK stop words
- older English equivalents to those stop words (i.e. thy, doth, ere, etc.)
- poet names (because some may have gotten through in the scraping process), minus any names that may also be used as words
- HTML tags that may have gotten through the scraping process (this was an issue during my initial scrape, but I believe was corrected during the cleaning process; still, better safe than sorry)
- words of questionable value discovered in the first round of EDA (such as would, upon, and may)
There were 119,285 unique words in the corpus and 1,165,726 total words. After processing, this went down to 36,443 unique words and 585,256 total words.
The 25 most frequently used words are (again, after processing and lemmatization):
There are a lot of visual (see, eye, light, look, white, face), temporal (day, night, time, old, long, never), and conceptual (love, life, man, heart, thing, still, world) terms.
I find it interesting that come just barely edged out love for the top spot. Again, this is after lemmatization, so this is a combination of come, comes, coming, came, etc. This perhaps simultaneously points to a call to action (a beckoning, a la “Come here!”), a passive observation (“He comes from a distant city…”, from Diane di Prima’s An Exercise in Love), as well as the sexual verb, which is undoubtedly more common in the post-19th Century movements.
Breaking down word frequency by movement paints a clearer picture of some of the differences in language used:
Metropolitan, Modern, and Avant-Garde poetries tend to focus more on the visual and temporal, with Avant-Garde also including some more specifically natural words like water, tree, sea, and leaf. It is also worth noting that love is only the eighth most popular term for Avant-Garde, whereas it’s in the top three of the other movements.
Pre-1900 poetry skews more conceptual and ethereal, with words like soul and god, which are unique to this movement’s top 25. I’m also surprised at the relative lack of natural terms (with the exception of sea), considering this movement includes the Romantic genre, which is known for glorifying nature.
Black is unique to Metropolitan’s list, which can presumably be explained by the Harlem Renaissance and Black Arts Movement genres, as well as the darker, gritty aesthetic of city-based poetry by Beat and New York School poets.
Finally, it is worth noting the scale of each of these graphs, which reflects the wordiness and repetition of Pre-1900 poetry and the opposite qualities in Avant-Garde poetry. As has generally been the case in my analyses, Metropolitan and Modern poetries lie somewhere in the middle.
Modeling
I ran Naive Bayes, KNN, Decision Tree, Random Forest, and SVM models using a TF-IDF vectorizer. My final implementation, however, was an SVM model using Doc2Vec document vectors instead, which provided me with a decent F1 score and the best fit by far. Although I kept them out of my final Jupyter notebook for the sake of brevity, I also ran XGBoost and LSTM models, which showed some promise but weren’t quite up to the level of my final model.
The importance of numerical data
All of my models consistently performed much better when using both the word (or document) vectors plus my engineered features. Generally, a model would see around a 10% boost when including these features.
The baseline model, for which I used Bernoulli Naive Bayes on both TF-IDF vectors and my engineered features, achieved an F1 score of 42.7%. This is considerably better than just predicting the dominant class, which accounted for 29% of the data. Still, as you can see in the confusion matrix below, it did indeed overpredict on the dominant class, Modern, even though there wasn’t much of an imbalance.
I had some success with K-Nearest Neighbors, which suggests a certain amount of clustering in the data, as well as a Random Forest. The latter was extremely overfit, however.
Similarly overfit was my best model, an SVM, with the TF-IDF vectors and numerical data. This was relatively unsurprising given SVM’s general success with text classification and data for which there are more features than datapoints. This achieved the best F1 score, but would not generalize well on unseen data.
Combining my numerical features with Doc2Vec embeddings proved to be the model that best generalizes on unseen data, without taking too much of a hit in F1 score.
Other than the baseline, my models were consistently better at picking out Pre-1900 poetry, without much confusion between that movement and the other three. Avant-Garde, Metropolitan, and Modern proved more difficult to differentiate and were generally confused for each other. The final model seems to suggest Modern being the closest movement to Pre-1900, with 15% of Modern poems being incorrectly classified as Pre-1900 poems. Avant-Garde and Metropolitan appear very similar to each other, which makes sense from a poetry standpoint.
Run time was not an issue for most of my models, which is partially a result of having a relatively small dataset. My Doc2Vec model runs nearly instantly, having only 100 dimensions and 7 engineered features.
Final model
I trained a final model using all of the data, and the F1 score increased to 66.8%. Pre-1900 was indeed the easiest to identify, and the other three movements were fairly similar to each other, with Modern being the most difficult to correctly identify (an F1 score of 60%).
The F1 score of each individual movement increased after training on the entire dataset. Modern saw the biggest jump from 46% to 60%. Avant-Garde saw a surprisingly large boost in accuracy after training on the entire dataset, with it’s accuracy score moving from 51% to 65%. Being the smallest class (at about 22%) may explain this; more data is almost always a good thing. It’s F1 score jumped from 53% to 62%.
Top features
Except for my baseline and TF-IDF SVM models, many if not all of my engineered features were prominently within the top ten most important features.
In my final model, five of my seven features were in the top ten:
The ratio of end rhymes to total lines made the top spot by a healthy margin, followed by the average number of words per line, the total number of lines, and lexical richness. The average number of syllables per word was the other engineered feature that made the top ten. Polarity and sentiment scores were the only two that didn’t measure much importance to the model.
By using document vectors instead of TF-IDF vectors, I do end up losing some interpretability, given that the other features in the above chart are merely five out of 100 mysterious dimensions. Still, by using a set of features that totals 107 as opposed to 43,053, I produced a much simpler model with similar efficacy and a better ability to generalize.
This will help me more easily produce a recommendation system as well!
Recommendation system
Tune in later this week for a breakdown on how I built PO-REC, an algorithm that can recommend poems based on one word, multiple words in any format, and another poem within my dataset.
Conclusions
The power of form and structure! Numerical data based upon the form and structure of a poem proved to be consistently effective predictors of a poem’s movement.
Pre-1900 poems tend to be long, wordy, positive, full of rhymes, and use simpler, repetitious language.
Avant-Garde poems tend to be short, sparsely worded, unrhyming, and use complex, lexically rich language.
Metropolitan and Modern poems lie somewhere in between. Metropolitan poetry is most similar to Avant-Garde poetry, whereas Modern poetry shares similarities with all of the genres and is the only other genre to be somewhat similar to pre-1900 poetry.
Future considerations
In the future, it would be interesting to engineer even more features, such as other types of rhyming (use of internal rhymes or slant rhymes), verb tenses (whether a poem predominantly uses present or past tense), and use of white space (i.e. whether a poem always starts on the left part of the line). Topic modeling may yield some interesting results as well.
Furthermore, I plan on trying to build this out using the actual genres (of which there are 13), as opposed to the four umbrella-like movements discussed here. This will present some notable challenges, not least of which is the large class imbalance. Modern poetry, which is its own genre and movement, accounts for over a quarter of all the poems. Although this will assuredly result in much less accurate models, it will also shed some light on the intricacies within poetic movements.
Project repo
You can check out my project repo on GitHub:
https://github.com/p-szymo/poetry_genre_classifier