Text Analytics With PostgreSQL — Text Statistics
A data analysis story with PostgreSQL, SQLAlchemy, FastAPI and Plotly on my Medium articles
Not a Medium member? Use the Friend Link to read this article!
In the Django and FastAPI series I have detailed database relationship types for a small application around articles and writers. Curious to see how my Medium data looks like from a text point of view, I have downloaded it (Medium docs) and imported it in the models created in the above mentioned articles. And here’s the start of the story.
Importing Article Data
In the export of my data, I received a folder with all my articles (97) and comments ever created on the platform. They are in html
format and follow the naming convention: date_article-title_id
:
Extract Text From HTML
In order to extract the text from each article, I’m using beautifulsoup, powering many web scrapers out there:
import bs4
with open(article) as article_file:
soup = bs4.BeautifulSoup(article_file, "html.parser")
# remove the pre tags (used for code snippets)
for s in soup("pre"):
s.decompose()
# remove the footer tag
for s in soup("footer"):
s.decompose()…