Scraping text data into BigQuery and building a Keras RNN classifier model from DOJ press releases

John Bencina
Data Insights
Published in
8 min readJul 29, 2018

In this article, I will cover how you can scrape website text using Python’s BeautifulSoup; load and read the data from BigQuery; and quickly build an RNN classifier model using Keras. This post skips the theory and explanations of deep learning, and instead focuses on how to actually implement something of use.

You can check out the full repository on GitHub https://github.com/jbencina/dojreleases for greater detail and to download the code itself.

Scraping the data

The data we are pulling comes from the Department of Justice’s “Briefing Room” page located here. Each page contains 25 items which are tagged as “Press Release” or “Speech”. What we will do is download the text for each release and save it locally.

The focus of this article is not on web scraping, and for more detail you can read the commented code in scraper.py on the GitHub repo. In a nutshell we use Requests to get the webpage source code and have BeautifulSoup parse the content. In our case, the content we want consists of releases such as the ones found here https://www.justice.gov/opa/pr/kansas-city-area-laboratory-owner-convicted-illegally-storing-hazardous-waste.

Example DOJ release

The script saves these releases as JSON files which end up looking like this with six possible fields. For some reason, the DOJ does not always provide an id, topic, or component. In those cases, the data is blank for those attributes. Though it’s possible they slightly changed the CSS or location of that element.

Example DOJ release saved as JSON. It seems for this specific article, the ID is written inside the text contents

Uploading to Google BigQuery

So there are only 13K press releases coming in around 50MB of data. While not the typical use-case for BQ, the platform does make it easier to share and query this type of data so I opted to load it into the cloud anyways.

Uploading to BigQuery is easy and can be quickyl done via the web UI at https://bigquery.cloud.google.com. The Python script we used produces a combined.json file which is a JSON newline delimited file of all records. The BQ UI has a 10MB file limit, so I just loaded the JSON file into Google Cloud Storage and gave BigQuery the file location. I let BQ infer the schema.

The result is our data now ready to be used in BQ. My copy is publicly accessible here https://bigquery.cloud.google.com/table/jbencina-144002:doj.press_releases

Preprocessing for Keras Model

Full code for the Keras model can be found in the Keras Classifier Model.ipynb notebook included in the repo here.

We start by querying BQ using Pandas for all texts which have at least 1 topic assigned. We use the UNNEST function to explode the topics so we have one row per topic with the content repeating. Run pip install pandas-gbq if you have not done so first.

sql = """SELECT 
UPPER(contents) contents,
UPPER(topic_item) topic_item
FROM `jbencina-144002.doj.press_releases`,
UNNEST(topics) topic_item
where array_length(topics)>0"""
pd.read_gbq(sql, dialect='standard')

I decided to go with trying to predict the top 20 topics. Since each text can be a mixture of topics, I used the pd.get_dummies()function to one-hot encode each label so that it becomes a vector like [0,0,1,0,0,1,0]. Our input feature is simply the text.

Keras has built in text preprocessing that does two great things for us. First, it converts all words to an integer and allows us to specify a cap on the number of words based on frequency. Words below this cap get converted to our <UNK> token. Secondly, Keras also has the pad_sequences() function which ensures all of our sequences are of the same length. Shorter sequences are prepended with 0.

Important: num_words is tricky in this context. Say we choose 5,000. Keras actually picks the top 4,998 words with [index 1] being saved for our <UNK> token and [index 0] being saved for padding.

def text_to_seq(x_train, x_test, max_len, num_words):
t = Tokenizer(num_words=num_words, oov_token='<UNK>')
t.fit_on_texts(x_train)
train_seq = t.texts_to_sequences(x_train)
train_pad = pad_sequences(train_seq, maxlen=max_len)
test_seq = t.texts_to_sequences(x_test)
test_pad = pad_sequences(test_seq, maxlen=train_pad.shape[1])

return train_pad, test_pad, {k: t.word_index.get(k,0) for k in ['<PAD>'] + list(t.word_index)[:num_words-1]}, t
train_pad, test_pad, vocab, t = text_to_seq(x_train, x_test, n_words, n_vocab)

For the RNN to work, the inputs for a batch must be the same size. The easiest approach is to make all inputs the same length which is why we use padding. The network should learn that 0s are padded values and ignore them. You can also specify that 0 represents a masked value and gradients will not be computed for them.

Leveraging transfer learning

We only have several thousand documents which is typically well below what deep learning requires for text training. However to get around that, we are going to leverage pre-trained word embeddings made available by Stanford’s GloVe project https://nlp.stanford.edu/projects/glove/. This should help us shortcut training by giving our network a head-start on understanding word representations. I ended up using the 100D Wikipedia crawl file.

Loading the embeddings into Keras takes some preprocessing. Our Keras embedding layer is expecting to be of size N x D where:

N = <UNK> + <PAD> + 4,998 words

D= Embedding dimensions (100 in our case)

Not all of the words in our corpus exist in the GloVe weights file. What we do instead is load all of the GloVe weights where a match is found. Then for each column, compute the mean and standard deviation, filling in the holes with a random sample from that distribution.

def load_glove(path, vocab):
df = pd.read_table(path, index_col=0, header=None, sep=' ', quoting=3, na_values=None, keep_default_na=False)
df.columns = [c for c in range(df.shape[1])]

# Limit Glove to our vocab and fill in missing words from normal distribution
col_means = glove.mean()
col_std = glove.std()
word_vec = glove.reindex([k for k in vocab])
for i,col in enumerate(word_vec.columns):
vals = word_vec[col].values
mask = word_vec[col].isna()
rand_vals = np.random.normal(col_means[i], col_std[i], size=sum(mask))
vals[mask] = rand_vals
return word_vec

glove = load_glove('glove.6B.100d.txt', vocab)

Here is a sample of the cleaned up GloVe data. Note that <PAD> and <UNK> don’t exist in the GloVe weights file however we see they have been populated with randomly chosen values from that column’s normal distribution. The same was done for other unseen tokens.

Building the model

Building a Keras model is super easy. We create a few layers using the Sequential() model:

  1. Embedding: Create an Embedding() layer that has the same shape as our GloVe weights, and initialize it with those weights. Masking is set to false because I’m using CuDNNLSTM which doesn’t support masking in Keras
  2. LSTM: Use CuDNNLSTM() or LSTM() (for those without GPU) to create the recurrent layer. I wrapped it in Bidirectional() which causes the RNN to run in both directions, concatenating the forward and backward pass into one long feature vector.
  3. Dense: Added a Dense() layer with RELU activation followed by dropout and another Dense() for our output predictions.
model = Sequential([
Embedding(glove.shape[0], glove.shape[1], input_length=train_pad.shape[1], mask_zero=False, weights=[glove.values]),
Bidirectional(CuDNNLSTM(128, return_sequences=False)),
Dense(32, activation='relu'),
Dropout(0.2),
Dense(n_topics, activation='sigmoid')
])
optimizer = Adam(lr=0.001)
es = EarlyStopping(patience=5)
def top_k(y_true, y_pred):
return top_k_categorical_accuracy(y_true, y_pred, k=2)

I chose the Adam() optimizer as it seemed to give the best results as well as EarlyStopping() to end training after 5 epochs of worsening validation loss. To measure success, I’m using the top_k_categorical_accuracy() and checking if our top 2 predictions contained the true result.

Training the model

Training the model is pretty simple. I use binary_crossentropy loss because each text could have more than one label. A smaller batch size seemed to work well too. On my NVIDIA GeForce GTX 1070, this training took 10–12 seconds per epoch.

model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=[top_k])
train_history = model.fit(x=train_pad, y=y_train, batch_size=32,
validation_data=(test_pad, y_test), epochs=100, verbose=2, callbacks=[es])

You can see in the beginning we have around 37% top_2 accuracy

But by the 30th epoch, we are around 83%. The network early stopping terminated training at this point.

The reason it stopped is because you can see the validation loss has plateaued by the 20th epoch while the training loss keeps decreasing. This is a sign we are beginning to overfit and should not train further.

Testing the model

Let’s run through some random tests and see how we did

Test 1

Our first test shows that the model correctly identified this release was talking about Human Trafficking though the confidence was low. Immigration was also in there which I suspect comes from words like “Mexican, alien, United States” contained in the release.

Test 2

Our second test did great and shows the model is pretty confident that this is about Health Care Fraud. It also slightly picked up on Stopfraud even though that prediction was very low

Test 3

Our last test has very low predictions for all values. This is actually good because notice the true label was not in the top 20. Thus it’s label vector would’ve been all 0s.

Next steps

Hopefully this article and the repository show some practical end-to-end uses for scraping web data and building a text classifier. Given the low volume of data, you are likely better off trying simpler ML modes like Logistic Regression or SVMs against TF or TF-IDF representations with n-grams of each text. You could also ditch the RNN/Embedding aspect of the model and simply do a fully connected network against a TF/TF-IDF matrix.

You can find the code for this exercise here https://github.com/jbencina/dojreleases

You can find the data for this exercise here https://bigquery.cloud.google.com/table/jbencina-144002:doj.press_releases

Edit Bonus: Linear Model

I closed off saying how a Linear Model may perform better given the lack of data. Rather than end on a cliffhanger, I updated the notebook to include a simple Logistic Regression + TF-IDF pipeline. As anticipated we’re seeing slightly better results with much less work (~4–5% top-2 acc. improvement). Though it likely possible to further explore the hyper-parameter space / network design to achieve similar results with the RNN. Sometimes keeping it simple has the best payoff

Linear model results
def build_linear_model(x_train, x_test, y_train, y_test):
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=5000, ngram_range=(1,2))
lr = OneVsRestClassifier(SGDClassifier(loss='log', max_iter=500))
pipeline = Pipeline([
('tfidf', tfidf_vectorizer),
('lr', lr)
])
pipeline.fit(x_train, y_train)pred_train = pipeline.predict_proba(x_train)
pred_test = pipeline.predict_proba(x_test)

return pipeline, pred_train, pred_test

--

--