Text Classification Using BigQuery ML and ML.NGRAMS

10 min readFeb 15, 2020

In this post, we’ll build a logistic regression classifier (binary) to predict positive or negative sentiment, using the IMDB movie reviews dataset.

The problem isn’t novel and the dataset isn’t particularly interesting, however the approach is new. We’ll be using BigQuery ML and their new feature processing function ML.NGRAMS to build a text classification model.

Using BQML allows us to prototype models quickly, without having to download data or script solutions using Pandas and Scikit (which are excellent libraries!). Turning our predictions into outputs other teams can use is also quite easy if both the data and models live in BQ.

Dataset: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data

Feel free to skip straight to the SQL if loading the data isn’t of any interest!

Getting the IMDB Dataset into BigQuery

I’ll upload the file (50k row csv file) to a bucket on Google Cloud Storage (GCS), then load into BigQuery. If you’d like to follow along, make sure to have the gcloud CLI installed. One way, if you’re on a Mac:

brew cask install google-cloud-sdk

Then…

gcloud init

and select the account you’re using BigQuery with.

Make a bucket

gsutil mb gs://imdb-bigquery-textclassifier

Upload your file to the bucket (after a quick local rename)

mv IMDB\ Dataset.csv data.csv
gsutil cp data.csv gs://imdb-bigquery-textclassifier

Now, you can load the table directly into BQ using the UI by clicking on “create table” from within your dataset and specifying the bucket URI (gs://your-bucket-name), or you can use the gcloud CLI. I’m using the CLI and asking BQ to automatically detect the schema, but you could also pass the location of a schema.json file. My bucket is not public, so you won’t be able to read it.

bq load \
--autodetect \
--source_format=CSV \
bqml.imdb_reviews \
gs://imdb-bigquery-textclassifier/data.csv

Inspecting the Table in BigQuery

inspecting the bigquery dataset and seeing that the columns are poorly named

Since the data had no columns, the field_names aren’t very good. We’ll fix that by creating a new table with better column names, while also randomly shuffling the data. Generally, shuffling data before ML is a good thing.

CREATE OR REPLACE TABLE bqml.imdb_reviews_clean AS
SELECT 
  string_field_0 as review_text,
  string_field_1 as review_label
  
FROM bqml.imdb_reviews
ORDER BY RAND()

That looks better!

Building our Features using ML.NGRAMS

BigQuery ML includes utility functions to help build features without leaving the familiar SQL environment. Here’s a handy link.

Preprocessing functions | BigQuery ML | Google Cloud

Whether your business is early in its journey or well on its way to digital transformation, Google Cloud's solutions…

cloud.google.com

Try it out real quick:

SELECT ML.NGRAMS(['the', 'dog', 'ran', 'quickly'], [1,2])

ML.NGRAMS takes an array, plus an array of integers to denote the range of “n” in n-gram. It returns an array or repeated field. Had we specified [1,3] as the 2nd argument, it would have built trigrams as well. In our case, a trigram is a string consisting of 3 words. From this string, “jeff is writing on medium”, a trigram would include “jeff is writing”, “is writing on”, “writing on medium”. Had we removed stop-words during a pre-processing step when we were generating the words_array, “is” wouldn’t have showed up :)

Ngrams are easy to generate using a scripting language like Python, but now it’s easy in SQL!

The next question is, “how do we get our review text into an array?”. We can use a simple regex pattern, along with REGEXP_EXTRACT_ALL. Here’s the SO post. Is this a perfect solution…who knows, but it was answered by someone who is Developer Advocate at Google. Shout out to Felipe Hoffa!

SELECT REGEXP_EXTRACT_ALL(LOWER('Running with the dog'), '[a-z]+')

Putting everything together: we convert the review text to an array, then keep the label, which the model will learn from.

SELECT 
  REGEXP_EXTRACT_ALL(LOWER(r.review_text ), '[a-z]+'),
  r.review_labelFROM bqml.imdb_reviews_clean r

Below, we extract the ngrams from the array of words we created above. The WITH statement is called a “common table expression” and helps us organize our SQL. In the “create model” section later on, I’ll use a plain old subquery. ‘base_table’ is just a convention I use, it has no special meaning — just a variable name. You could have called it “moose”.

WITH base_table AS (
  SELECT 
    REGEXP_EXTRACT_ALL(LOWER(r.review_text ), '[a-z]+') as words_array,
    r.review_labelFROM bqml.imdb_reviews_clean r
)SELECT ML.NGRAMS(bt.words_array, [1,2]), bt.review_label
FROM base_table bt

Creating the logistic regression classification model using BigQuery ML

Notice the relationship between the OPTIONS.input_label_cols and my SQL query beneath. If you do not pass in your input_label_cols, it will assume the column is named “label”. Better to be explicit.

I’ve also highlighted a few options related to class balance and data splitting methods. Other options are documented in this article, including regularization and some handy features to speed-up the retraining of existing models. There’s quite a bit you can control, but the defaults appear sensible. I would encourage a random data split if your data might be ordered according to the label. Recall we ordered by RAND() when we created “bqml.imdb_reviews_clean”.

CREATE OR REPLACE MODEL
  `bqml.imdb_reviews_model`OPTIONS
  ( model_type='LOGISTIC_REG',
    auto_class_weights=TRUE,
    data_split_method='RANDOM',
    DATA_SPLIT_EVAL_FRACTION = .10,
    input_label_cols=['review_label']
  ) ASSELECT 
  ML.NGRAMS(bt.words_array, [1,2]) as ngrams, 
  bt.review_label
FROM  (
  SELECT 
    REGEXP_EXTRACT_ALL(LOWER(r.review_text ), '[a-z]+') as words_array,
    r.review_labelFROM bqml.imdb_reviews_clean r
  -- there's a header field in the table :(
  WHERE r.review_label != 'sentiment'
) bt

IMPORTANT: An additional note regarding the above ^^ SQL, notice the WHERE clause on the 2nd to last line. I neglected to remove CSV header row before uploading and ingesting into BQ. We could have taken care of this when we created our table! The model will interpret this as data, which we don’t want.

After several minutes of training, our model is ready!

It lives in the dataset where you saved it, just like any other tables or views that might be in there. Clicking on the “Evaluation” tab of the model, you get a nice interface to view performance stats.

Reviewing the above, our ROC AUC was ~ 94%, which is very good. In brief, this says that as our true positive rate increases (y-axis on the shaded blue chart above, titled ROC curve), we don’t see a large overall “cost”, by way of an increasing false positive rate. The tradeoff between these two metrics (TPR and FPR) is generally specific to the problem at hand. In layman’s terms, as you identify more of what you want, you end up inadvertently picking more of what you don’t.

Another nice feature is the “Positive class threshold” slider. This is useful for scenario modeling. Imagine the cost of a false positive was quite high:

Marking an important email from your doctor as spam
sending a large discount to a customer who is about to purchase a $3,200 computer

Setting the threshold to 90%, we end up with very good precision, but poor recall. You the practitioner need to identify the tradeoffs associated with your specific problem and weigh your options. Even seasoned ML practitioners have to pause and reflect on this. It’s easy to make mistakes! If you’d like to learn more, here’s a good crash course by Google.

Model Diagnostics using BQML

At a glance, our model did well…quite a bit better than random chance (50% since the data was balanced). How can we learn more? Let’s use the ML.FEATURE_INFO function to explore the model’s features. NOTE — this has nothing to do with feature importance, which we’ll review momentarily.

SELECT
  *
FROM
  ML.FEATURE_INFO(MODEL `bqml.imdb_reviews_model`)

We have a single feature called “ngrams”. The distributional stats are null, since you can’t take the mean of ngrams or find its variance, etc…This raises the question: how did BQML actually process the NGRAM feature? The input was a repeated field, in BigQuery terminology.

This article covers how BQML handles inputs:

The CREATE MODEL statement | BigQuery ML | Google Cloud

To create a model in BigQuery, use the BigQuery ML CREATE MODEL statement. This statement is similar to theCREATE TABLE…

cloud.google.com

Numeric features are standardized
Categorical features are one-hot encoded

Here’s their statement on transforming repeated fields, or arrays. In our case, our array type is decidedly non-numeric (just words, which are strings).

For all non-numerical ARRAY columns, BigQuery ML performs a multi-hot encoding transformation. This transformation generates a separate feature for each unique element in the ARRAY.

My read on this, perhaps naive, is that the model is in effect creating a “bag of words” feature matrix. I could be mistaken (very much a non-zero probability!). I was experimenting with whether the ML.NGRAMS feature would actually provide a trainable input, capable of discriminating between my labels, and it did. If anyone knows please comment!

Model Feature Importance — Find words corresponding to the positive and negative classes

We can analyze feature importance using ML.WEIGHTS. There’s a lot to unpack in the code below, but in a nutshell:

We’re asking for the weights of our model.
We’re asking BQ to standardize them, as to render them comparable to one another (STRUCT true as standardize).
We’re also only unpacking the value from the multi-hot encoded feature we made using ML.NGRAMS, called ‘ngrams’ (processed_input = ‘ngrams’). Had you named your ngram feature something else, you’d need to change that.
Lastly, we sort the weights DESC, LIMIT 8 and UNION with the weights ASC, LIMIT 8. This lets us view the top 8 features (words) most associated with the positive and negative labels (16 rows in total). For further clarification, note my use of the static ‘label weight’ column to discriminate between the tables we’re UNION-ing together.

(SELECT
  category,
  weight, 
  'positive_weight' as label_weight
FROM
  UNNEST((
    SELECT
      category_weights
    FROM
  ML.WEIGHTS(MODEL `bqml.imdb_reviews_model`,
    STRUCT(true AS standardize)) 
    WHERE
      processed_input = 'ngrams'))
ORDER BY weight desc  
LIMIT 8)UNION ALL
(
  SELECT
    category,
    weight, 
    'negative_weight' as label_weight
  FROM
    UNNEST((
      SELECT
        category_weights
      FROM
    ML.WEIGHTS(MODEL `bqml.imdb_reviews_model`,
      STRUCT(true AS standardize)) 
      WHERE
        processed_input = 'ngrams'))
  ORDER BY weight asc  
  LIMIT 8
)

Above, we can see that the negative weight features (mainly bigrams) include “this turkey”, “this stinker”, “was laughable”, etc….Intuitively, these correspond to negative reviews. The weights associated with the positive class aren’t as obviously positive.

Using the trained model to generate predictions

Amazing, we have a decent model and analyzed its features. Let’s generate some predictions! Below, we’re calling ML.PREDICT, then passing in a query which contains some ngrams, which the model expects to process, since it learned from that feature.

Understand that I’m selecting everything from the temporary validation (evaluation) table BQML creates as part of the training process. The model has not learned from these labels, but it has used them to evaluate accuracy during training.

SELECT
  *
FROM
  ML.PREDICT(MODEL `bqml.imdb_reviews_model`,
( 
  SELECT * EXCEPT(review_label)
  -- If you inspect your model and view the temporary tables for training and evaluation
  -- this is the temporary evaluation set
  FROM `_37862ad76a3f35f96b1dd8ecaa40bf37a7f4d5d7.anond497aa8b_cff1_4330_aec4_725b8d70b290_imported_data_split_eval_data`
))

The output is below. Notice how each row contains a STRUCT field called “predicted_review_label_probs” which contains a value (.label) and associated probability for each label (prob.). A struct is like a mini-table within a larger table. In this case, it’s a negative label, with 90% probability. Apparently they used to love the movie….oh, how things change.

Querying the STRUCT field from ML.PREDICT

Unless you’re a wizard, it may not be obvious how to only keep predictions WHERE the the label is positive with a probability of more than X%. Here’s one way:

Let’s make at table to keep our predictions for now:

CREATE OR REPLACE TABLE bqml.eval_preds AS 
SELECT
  *
FROM
  ML.PREDICT(MODEL `bqml.imdb_reviews_model`,
( 
  SELECT * EXCEPT(review_label)
  -- If you inspect your model and view the temporary tables for training and evaluation
  -- this is the temporary evaluation set
  FROM `_37862ad76a3f35f96b1dd8ecaa40bf37a7f4d5d7.anond497aa8b_cff1_4330_aec4_725b8d70b290_imported_data_split_eval_data`
))

SELECT * EXCEPT(ep.ngrams)
FROM bqml.eval_preds ep, 
  UNNEST(ep.predicted_review_label_probs ) AS probs
WHERE probs.label = 'positive' 
AND probs.prob > .55

Above, we select everything EXCEPT for the ngrams from our table of predictions on the hold-out (eval) set. The key to access the RECORD (or STRUCT) which contains the labels and probabilities is to UNNEST the RECORD, alias it as probs, then we can access the keys within the STRUCT, ‘label’ and ‘prob’ from within the WHERE clause. Easy right? :) It’s like a python dictionary within each row, sort of.

We could have also specified a threshold within ML.PREDICT by passing in a STRUCT. (insert easy button).

SELECT
  *
FROM
  ML.PREDICT(MODEL `mydataset.mymodel`,
    (
    SELECT
      custom_label,
      column1,
      column2
    FROM
      `mydataset.mytable`),
    STRUCT(0.55 AS threshold))

Hopefully you enjoyed the post! Feel free to connect on LinkedIn or be in touch. https://www.linkedin.com/in/jeffjames/