Text Analytics With PostgreSQL — Text Statistics

A data analysis story with PostgreSQL, SQLAlchemy, FastAPI and Plotly on my Medium articles

7 min readAug 6, 2024

Not a Medium member? Use the Friend Link to read this article!

In the Django and FastAPI series I have detailed database relationship types for a small application around articles and writers. Curious to see how my Medium data looks like from a text point of view, I have downloaded it (Medium docs) and imported it in the models created in the above mentioned articles. And here’s the start of the story.

Importing Article Data

In the export of my data, I received a folder with all my articles (97) and comments ever created on the platform. They are in html format and follow the naming convention: date_article-title_id :

Screenshot of the export, with the list of articles

Extract Text From HTML

In order to extract the text from each article, I’m using beautifulsoup, powering many web scrapers out there:

import bs4


with open(article) as article_file:
    soup = bs4.BeautifulSoup(article_file, "html.parser")
    # remove the pre tags (used for code snippets)
    for s in soup("pre"):
        s.decompose()
    # remove the footer tag
    for s in soup("footer"):
        s.decompose()…

Text Analytics With PostgreSQL — Text Statistics

A data analysis story with PostgreSQL, SQLAlchemy, FastAPI and Plotly on my Medium articles

Importing Article Data

Extract Text From HTML

Written by Petrica Leuca