Predicting the publisher’s name from an article: A case study

Sayak Paul
Google Developer Experts
12 min readAug 1, 2019

In this article, I will be walking you through my approach (with some code) for predicting the publisher’s name from an article using various Google Cloud technologies.

But before I do so let me give the problem statement a bit more reality:

Imagine being the moderator of an online news forum and you’re responsible for determining the source (publisher) of the news article. Doing this manually can be a very tedious task as you’ll have to read the news articles and then derive the source. So, what if you could automate this task? So, at a very diluted level the problem statement becomes can I predict the publisher’s name from a given article?

The problem can now be modeled as a text classification problem. In the rest of the article, I will share the steps I took to solve it. The summary of the steps looks like so:

  • Gather data
  • Preprocess the dataset
  • Get the data ready for feeding to a sequence model
  • Build, train and evaluate the model
Photo by Web Hosting on Unsplash

System setup

I will be using Google Cloud Platform (GCP) as my infrastructure. It is also easier for me to configure a system I would need for this project starting from the data to the libraries for building the model(s).

I started off by spinning off a Jupyter Lab instance which comes as a part of GCP’s AI Platform. To be able to spin off a Jupyter Lab instance on GCP’s AI Platform, you would need a billing-enabled GCP Project. One can navigate to the Notebooks section on the AI Platform very easily:

Navigate to the Notebooks’ section on the AI Platform

After clicking on the Notebooks, a dashboard like the following lands up:

Notebooks dashboard on the AI Platform

I had decided to use TensorFlow 2.0 for this project and I chose accordingly:

TensorFlow 2.0 Notebook instance with a Tesla K80 GPU

After clicking on With 1 NVIDIA Tesla K80, you would be shown a basic configuration window. I kept it default, just ticked the GPU driver installation box and then clicked on CREATE.

AI Platform Notebook configuration

It took some time to get the instance (~ 5 minutes). I just needed to click on OPEN JUPYTERLAB to access the notebook instance after the instance was ready.

Notebooks instances

I used BigQuery in this project and that too via the notebooks. So, as soon as I got the notebook instance I opened up a terminal to install the BigQuery notebook extension:

pip3 install --upgrade google-cloud-bigquery

That’s it for system setup part.

BigQuery is a serverless, highly-scalable, and cost-effective cloud data warehouse with an in-memory BI Engine and machine learning built-in.

Where do I get the data?

It may not always be the case that the data will be readily available for the problem you’re trying to solve. Fortunately, in my case, I found out a dataset which was good enough to start with.

The dataset I’m going to use is already available as a BigQuery public dataset (link). But the dataset needs to be shaped a bit aligning to with respect to the problem statement. We’ll come to this later.

This dataset contains all stories and comments from Hacker News from its launch in 2006 to present. Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.

To get the data right in my notebook instance, I first configured the GCP Project within the notebook’s environment:

# Set your Project ID
import os
PROJECT = ‘your-project-name’
os.environ[‘PROJECT’] = PROJECT

I was now ready to run a query which would access the BigQuery dataset:

%%bigquery --project $PROJECT data_preview
SELECT
url, title, score
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
LENGTH(title) > 10
AND score > 10
AND LENGTH(url) > 0
LIMIT 10

Let me break down a few things here:

  • %%bigquery — -project $PROJECT data_preview: %%bigquery is a magic command which lets you run SQL like queries (compatible for BigQuery) from your notebook. — -project $PROJECT is used to guide BigQuery which GCP Project I’m using. data_preview is the name of the Pandas DataFrame to which I’m going to save results of the query (isn’t this very useful?).
  • hacker_news is the name of the BigQuey public dataset and stories is the name of the table residing inside it.
  • I selected three columns only: url of the article, title of the article and score of the article. I used the article titles to determine their sources.

I chose to include only those entries where the length of the article title and article’s corresponding URL is greater than 10. The query returned 402 MB of data.

Here are the first five rows from the DataFrame data_preview:

First five rows from the resulting query

The data collection part is now done for the project. At this stage, I was good to proceed to the next steps: cleaning and preprocessing!

Beginning data wrangling

The problem in the current data is in place of url I need the source of the URL. For example, https://github.com/Groundworkstech/Submicron should appear as github. I would also want to rename the url column to source. But before doing that, I figured out the distribution in the titles belonging to several sources.

%%bigquery --project $PROJECT source_num_articles
SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
COUNT(title) AS num_articles
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
GROUP BY
source
ORDER BY num_articles DESC

BigQuery provides a number of functions like ARRAY_REVERSE(), REGEXP_EXTRACT() and so on for useful tasks. With the above query, I first split the URLs with respect to // and / and then I extracted the domains from the URLs. The preview of the source_num_articles DataFrame looks like so:

Number of articles grouped by their sources

But the project needed different data — a dataset which would contain the articles along with their sources. The stories table contains a lot of article sources other than the ones shown above. So, to keep it a bit more light-weighted, I decided to go with these five ones: blogpost, github, techcrunch, youtube and nytimes.

%%bigquery --project $PROJECT full_data
SELECT source, LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR
source = 'techcrunch' or source = 'blogspot' OR
source = 'youtube')

Now, I have the data in good shape:

Article titles with their sources

I tend to spend a generous amount of time with the data after it is somewhat ready for modeling and now was a perfect time to perform some EDA.

Data understanding

I start the process of EDA by investigating the dimensions of the dataset. In this case, the dataset prepared in the above step had 168437 rows including 2 columns as can be seen in the preview.

Following is the class distribution of the articles:

Class distribution of the articles

Fortunately enough there were no missing values in the dataset and the following little tweedle helped me knowing that:

# Missing value inspection
full_data.isna().sum()

A common question that arises while dealing with text data like this is — how is the length of the titles distributed?

Fortunately, Pandas provides a lot of useful functions to answer questions like this

full_data[‘title’].apply(len).describe()
Basic statistics about the length of the article titles

We have a minimum length of 11 and a maximum length of 138. I will come to this again in a moment.

EDA is incomplete without plots! In this case, a very useful plot could be Count vs. Title lengths:

A plot capturing the length of the titles with their respective counts

Almost bell, isn’t it? From the plot, it is evident that the counts are skewed for title lengths < 20 and > 80. So, I may have to be careful in tackling them. This led me to perform some manual inspections to figure out:

  • how many titles fall above the minimum title length (11)?
  • how many titles have the maximum length (138)?

Let’s find out.

(text_lens <= 11).sum(), (text_lens == 138).sum()

I got 513 and 1 respectively. I would remove the entry denoting the maximum article length from the dataset since it’s just 1:

full_data = full_data[text_lens < 138].reset_index(drop=True)

The last thing I did in this step was splitting the dataset into train/validation/test sets in a ratio of 80:10:10.

# 80% for train
train = full_data.sample(frac=0.8)
full_data.drop(train.index, axis=0, inplace=True)
# 10% for validation
valid = full_data.sample(frac=0.5)
full_data.drop(valid.index, axis=0, inplace=True)
# 10% for test
test = full_data
print(train.shape, valid.shape, test.shape)

The new data dimensions are: ((110070, 2), (13759, 2), (13759, 2)). Just to be a little more certain on the class distribution, I checked it again across the three sets:

Class distribution in train/validation/test sets

The distributions are relatively same across the three sets. I serialized these three sets to Pandas DataFrames.

train.to_csv('data/train.csv', index=False)
valid.to_csv('data/valid.csv', index-False)
test.to_csv('data/test.csv', index=False)

There’s still some amount of data preprocessing need — as computers only understand numbers, I needed to prepare the data accordingly to stream to the machine learning model:

  • Encoding the classes to some numbers (label encoding/one-hot encoding)
  • Creating a vocabulary from the training corpus — tokenization
  • Numericalizing the titles and pad them to a fixed-length
  • Preparing the embedding matrix with respect to pre-trained embeddings like GloVe.

Let’s proceed accordingly.

Additional data preprocessing

First, I defined the constants that would be necessary here:

# Label encode
CLASSES = {'blogspot': 0, 'github': 1, 'techcrunch': 2, 'nytimes': 3, 'youtube': 4}
# Maximum vocabulary size used for tokenization
TOP_K = 20000
# Sentences will be truncated/padded to this length
MAX_SEQUENCE_LENGTH = 50

Now, I defined a tiny helper function which would take a Pandas DataFrame and would

  • prepare a list of titles from the DataFrame (needed for further preprocessing)
  • take the sources from the DataFrame, map them to integers and append to a NumPy array
def return_data(df):
return list(df['title']), np.array(df['source'].map(CLASSES))
# Apply it to the three splits
train_text, train_labels = return_data(train)
valid_text, valid_labels = return_data(valid)
test_text, test_labels = return_data(test)
train_text[0], train_labels[0]

The result was just what I expected:

('nerds love twitch because there  they can be heroes', 2)

I used the text and sequence modules provided by tensorflow.keras.preprocessing to tokenize and pad the titles. First, I went with tokenization:

# Create a vocabulary from training corpus
tokenizer = text.Tokenizer(num_words=TOP_K)
tokenizer.fit_on_texts(train_text)

I used theGloVe embeddings to represent the words in the titles to a dense representation. The embeddings’ file is of more than 650 MB and the GCP team has it stored in a Google Storage Bucket. This was incredibly helpful since it allowed me to directly use it in the notebook at a very fast speed. I used the gsutil command (available in the Notebooks) to aid this.

!gsutil cp gs://cloud-training-demos/courses/machine_learning/deepdive/09_sequence/text_classification/glove.6B.200d.txt glove.6B.200d.txt

I need a helper function which will map the words in the titles with respect to the Glove embeddings.

def get_embedding_matrix(word_index, embedding_path, embedding_dim):
embedding_matrix_all = {}
with open(embedding_path) as f:
for line in f: # Every line contains word followed by the vector value
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embedding_matrix_all[word] = coefs
# Prepare embedding matrix with just the words in our word_index dictionary
num_words = min(len(word_index) + 1, TOP_K)
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
if i >= TOP_K:
continue
embedding_vector = embedding_matrix_all.get(word)
if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector

return embedding_matrix

That is all I needed to stream the text data to the yet-to-be-built machine learning model.

Building the Horcrux: A sequential language model

I generally specify many hyperparameter values towards the very beginning of the modeling process.

# Specify the hyperparametersfilters=64
dropout_rate=0.2
embedding_dim=200
kernel_size=3
pool_size=3
word_index=tokenizer.word_index
embedding_path = 'glove.6B.200d.txt'
embedding_dim=200

I used a Convolutional Neural Network based model which would basically start by convolving on the embeddings fed to it. Locality is important in sequential data and CNNs would allow me to capture that effectively. The trick is to do all the fundamental CNN operations (convolution, pooling) in 1D.

I followed the typical Keras paradigm — I first instantiated the model, then defined the topology and then compiled the model accordingly.

# Create model instance
model = models.Sequential()
num_features = min(len(word_index) + 1, TOP_K)
# Add embedding layer - GloVe embeddings
model.add(Embedding(input_dim=num_features,
output_dim=embedding_dim,
input_length=MAX_SEQUENCE_LENGTH,
weights=[get_embedding_matrix(word_index,
embedding_path, embedding_dim)],
trainable=True))
model.add(Dropout(rate=dropout_rate))
model.add(Conv1D(filters=filters,
kernel_size=kernel_size,
activation='relu',
bias_initializer='he_normal',
padding='same'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters=filters * 2,
kernel_size=kernel_size,
activation='relu',
bias_initializer='he_normal',
padding='same'))
model.add(GlobalAveragePooling1D())
model.add(Dropout(rate=dropout_rate))
model.add(Dense(len(CLASSES), activation='softmax'))
# Compile model with learning parameters.
optimizer = tf.keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['acc'])

The architecture looks like so:

The architecture of the CNN-based text classification model

One more step that was remaining at this point was Numericalizing the titles and pad them to a fixed-length.

# Preprocess the train, validation and test sets
# Tokenize and pad sentences
preproc_train = tokenizer.texts_to_sequences(train_text)
preproc_train = sequence.pad_sequences(preproc_train, maxlen=MAX_SEQUENCE_LENGTH)
preproc_valid = tokenizer.texts_to_sequences(valid_text)
preproc_valid = sequence.pad_sequences(preproc_valid, maxlen=MAX_SEQUENCE_LENGTH)
preproc_test = tokenizer.texts_to_sequences(test_text)
preproc_test = sequence.pad_sequences(preproc_test, maxlen=MAX_SEQUENCE_LENGTH)

And finally, I was prepared to kickstart the training process!

H = model.fit(preproc_train,
train_labels,
validation_data=(preproc_valid, valid_labels),
batch_size=128,
epochs=10,
verbose=1)

Here’s a snap of the training log:

Loss and accuracy from the last three epochs

The network does overfit and the training graph also confirms it:

Training graph of the network

Overall, the model yields an accuracy of ~66% which is not upto the mark given the developments of the hour. But it was a good start. I also wrote a little function to use the network to predict the on individual samples:

# Helper function to test on single samples
def test_on_single_sample(text):
category = None
text_tokenized = tokenizer.texts_to_sequences(text)
text_tokenized = sequence.pad_sequences(text_tokenized,maxlen=50)
prediction = int(model.predict_classes(text_tokenized))
for key, value in CLASSES.items():
if value==prediction:
category=key

return category

I then prepared the samples accordingly:

# Prepare the samples
github=['Invaders game in 512 bytes']
nytimes = ['Michael Bloomberg Promises $500M to Help End Coal']
techcrunch = ['Facebook plans June 18th cryptocurrency debut']
blogspot = ['Android Security: A walk-through of SELinux']

Finally, I tested test_on_single_sample() on the above samples:

for sample in [github, nytimes, techcrunch, blogspot]:
print(test_on_single_sample(sample))

And the results:

github
techcrunch
techcrunch
blogspot

That was it for this project. In the next section, I comment on the future directions I decided to take for this project and then I finally conclude by enlisting the references.

Future directions and references

Just like in the computer vision domain, we expect models that understand the domain to be robust against certain transformations like rotation and translation, in the sequence domain, it’s important then models be robust to changes in the length of the pattern. Keeping that in mind, here’s a list of what I would try in the near future:

  • Try other sequence models
  • A bit of hyperparameter tuning
  • Learn the embeddings from scratch
  • Try different embeddings like universal sentence encoder, nnlm-128 and so on

After I have a decent model (with at least ~80% accuracy), I plan to serve the model as a REST API and deploy it on AppEngine.

Following are the references that were very useful for this project:

A Jupyter Notebook version of the article is available here in case you want to play around.

It’s an end to the article here. I wrote this article to walk the readers through the approach I generally take for a machine learning problem. Of course, there’s more to it but the steps I showed above are the most important ones for me. Thank you for taking the time to read the article and I will see you next time :)

I would like to thank the Global Google Developers’ Expert Program team for providing us with GCP credits and this project is indeed a GCP credit supported activity.

About the author

Sayak loves everything deep learning. He goes by the motto of understanding complex things and help people understand them as easily as possible. Sayak is an extensive blogger and all of his blogs can be found here. He is also working with his friends on the application of deep learning in Phonocardiogram classification. He is a Google Developers’ Expert in Machine Learning and an Intel Software Innovator as well. He is always open to discussing novel ideas and taking them forward to implementations. You can connect with Sayak on LinkedIn and Twitter.

--

--