Predicting the publisher’s name from an article: A case study
--
In this article, I will be walking you through my approach (with some code) for predicting the publisher’s name from an article using various Google Cloud technologies.
But before I do so let me give the problem statement a bit more reality:
Imagine being the moderator of an online news forum and you’re responsible for determining the source (publisher) of the news article. Doing this manually can be a very tedious task as you’ll have to read the news articles and then derive the source. So, what if you could automate this task? So, at a very diluted level the problem statement becomes can I predict the publisher’s name from a given article?
The problem can now be modeled as a text classification problem. In the rest of the article, I will share the steps I took to solve it. The summary of the steps looks like so:
- Gather data
- Preprocess the dataset
- Get the data ready for feeding to a sequence model
- Build, train and evaluate the model
System setup
I will be using Google Cloud Platform (GCP) as my infrastructure. It is also easier for me to configure a system I would need for this project starting from the data to the libraries for building the model(s).
I started off by spinning off a Jupyter Lab instance which comes as a part of GCP’s AI Platform. To be able to spin off a Jupyter Lab instance on GCP’s AI Platform, you would need a billing-enabled GCP Project. One can navigate to the Notebooks section on the AI Platform very easily:
After clicking on the Notebooks, a dashboard like the following lands up: