Predicting the publisher’s name from an article: A case study

Sayak Paul
Google Developer Experts
12 min readAug 1, 2019

In this article, I will be walking you through my approach (with some code) for predicting the publisher’s name from an article using various Google Cloud technologies.

But before I do so let me give the problem statement a bit more reality:

Imagine being the moderator of an online news forum and you’re responsible for determining the source (publisher) of the news article. Doing this manually can be a very tedious task as you’ll have to read the news articles and then derive the source. So, what if you could automate this task? So, at a very diluted level the problem statement becomes can I predict the publisher’s name from a given article?

The problem can now be modeled as a text classification problem. In the rest of the article, I will share the steps I took to solve it. The summary of the steps looks like so:

  • Gather data
  • Preprocess the dataset
  • Get the data ready for feeding to a sequence model
  • Build, train and evaluate the model
Photo by Web Hosting on Unsplash

System setup

I will be using Google Cloud Platform (GCP) as my infrastructure. It is also easier for me to configure a system I would need for this project starting from the data to the libraries for building the model(s).

I started off by spinning off a Jupyter Lab instance which comes as a part of GCP’s AI Platform. To be able to spin off a Jupyter Lab instance on GCP’s AI Platform, you would need a billing-enabled GCP Project. One can navigate to the Notebooks section on the AI Platform very easily:

Navigate to the Notebooks’ section on the AI Platform

After clicking on the Notebooks, a dashboard like the following lands up:

Sayak Paul
Google Developer Experts

ML at 🤗 | Netflix Nerd | Personal site: https://sayak.dev/