Gippius Part 2: Scraping Poems from HTML Tags and Performing Textual Data Feature Engineering

Anna Chernysheva
4 min readNov 23, 2023

In the previous part of the project, we obtained the full list of Zinaida Gippius’ poem page URLs from ollam.ru and stored them in a MySql database.

The objective of this phase is to retrieve all poems along with their titles, and perform necessary feature engineering to create a dataset with the following information: the poem corpus, title, year, month, and location (if available) where the poem was written. Additionally, for future exploratory data analysis (EDA), we will include an ‘age’ column that indicates Gippius’ age when she wrote a specific poem.

1. Retrieving URLs from MySql Database

We will use mysql.connector to connect to the db and navigate within our project data.

import mysql.connector
import pandas as pd

cursor = cnx.cursor()

#execute the SQL query to fetch the "poems_url" table
query = "SELECT * FROM poems_url"
cursor.execute(query)

#fetch all the rows from the result set
rows = cursor.fetchall()

#create a list of urls from the fetched data
poems = pd.DataFrame(rows, columns=cursor.column_names).url.tolist()

Eventually, we receive a list of all URL pages containing all Gippus poems.

2. Scraping Poems from HTML tags

I am deeply thankful to the developers of this site (or the Drupal CMS developers on which this site is deployed) for the outstanding structuring. Each poet has their own section with the eponymous path name. The main page of the section includes the full list of works and some filtering options. Poem paths come after the section name (situated in the section folder). This is obvious by glancing at the breadcrumbs.

The HTML patterns show the same consistent organization. Every poem can be easily extracted by manipulating <p> tags, as there’s no other(except one paragraph) content or SEO text represented by this tag on the page.

Now, let’s define a function to scrape the paragraphs and <h1> from a URL page and apply it to all web pages.

#connect the page an retrieve html tags
def retrieve_paragraphs(url):
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
paragraphs = soup.find_all('p')
title = soup.find('h1').get_text().strip()

return [p.get_text().strip() for p in paragraphs], title

#iterate over all urls from list
paragraphs_and_titles = [(retrieve_paragraphs(poem), poem) for poem in poems]
paragraphs = ['\n'.join(sublist[0][0]) for sublist in paragraphs_and_titles]
titles = [sublist[0][1] for sublist in paragraphs_and_titles]

Subsequently, we create a dataframe poems_df that will store the poems. As mentioned above, there’s an odd paragraph that requires removal.

#create and manipulate the df 
poems_df = pd.DataFrame({'Poem': paragraphs, 'Title': titles})
todrop = '\nПоэтический портал Оллам. Электронная онлайн публикация стихотворений. Права на все произведения принадлежат их авторам © 2018 Россия'
poems_df['Poem'] = [poems_df['Poem'][i].replace(todrop, '') for i in range(len(poems_df['Poem']))]

3. Manipulating Poems HTML Text for Feature Engineering

The rules (non-mandatory) of a poem organization consider outlining the date and the place it was written. A casual preliminary analysis revealed that at least one-third of the poems include this information, prompting us to extract it whenever possible. As a condition, we created a new list called year since most poems with a date have it in the last <p> tags. Specifically, we extracted the text between the last two space delimiters (‘\n’)[-2].

#create a list of poems
year = []
for poem in poems_df['Poem']:
last_element = poem.split('\n')[-2]
year.append(last_element)

#add them to the dataframe column
poems_df['Year']=pd.Series(year)

However, there are other cases where this position contains location information, or other words, necessitating manual examination and the use of Google Sheets to extract the relevant features.

In order to show the poet's age, we took the date of birth and subtracted it from the year of the poem's origin.

Then, we applied translation and standardization to new columns:

#standatize location
poems_df['Location'] = poems_df['Location'].replace(month_dict)

##standatize date
poems_df['Month'] = poems_df['Month'].replace(month_dict)

Voilà! Now, let’s send our beautiful dataframe to the database!

Note, that not all poems include year, month and location, that causes ‘nan’

Thanks for reading! In the next part of the project, we will start searching for different insights through EDA. Join! →

--

--

Anna Chernysheva

SEO Analyst and Data Scientist. Specializing in linguistic tasks.