NLP with SpaCy, Dataflow ML and BigQuery ML

Cloud + NLP + ML for Text Clustering

Aniket Agrawal

Published in

Google Cloud - Community

11 min readMar 15, 2023

**Text Clustering:** SpaCy for NLP and Google Cloud (Dataflow + BigQuery) for ML

Chatbots and Artificial Intelligence (AI) have swiftly risen to prominence in the technology industry. Conversational AI leverages advanced Natural Language Processing (NLP) models trained on an enormous dataset. In this blog, we will explore the implementation of one of the text clustering strategies premised on SpaCy and the Machine Learning (ML) capabilities of Google Cloud Platform (GCP). This may be utilised to assist other exciting applications such as document plagiarism detection, robust question-answering (QA) systems, and so on.

Architecture Diagram for the Overall Process

To implement the aforementioned approach in accordance with the preceding architecture roadmap, the following actions are performed:

Data Source: Around 9311 sentences from 11 public articles are segregated and placed row-by-row in a BigQuery table via a CSV upload.
Pipeline + Processing: Then, we’ll see how a trained model under an industrial-strength NLP library called SpaCy generates document vectors for each row in the pipeline. To run this model within its Dataflow pipeline, a ML model handler is required. For this, Dataflow ML integrates the powers of Dataflow with Apache Beam’s RunInference (RI) API to smoothly update the model properties and pass these to the RI Ptransform.
Data Sink + Clustering: All the vectors are written to the BigQuery table. So, BigQuery not only serves as a data source and sink but is also used to cluster the vectors using its ML powers.
Visualize: The pipeline data transmission (Step 2) is visualized in the Vertex AI Workbench user-managed notebooks. Also, the resulting model and cluster metrics (Step 3) are visualized in the console as well as Looker Studio.

Note: The emphasis will be on hard clustering, which ensures that each sentence belongs to exactly one document cluster. The softer or fuzzy counterpart will be explored later and is a great approach in an NLP context to determining the likelihood of a sentence belonging to more than one cluster.

**Source:** Left Image (Fig Tree: Fuzzy Fig Clusters) and Right Image (Berries Segregated: Hard Berry Clusters) **Soft Cluster:** Orange figs grouped with red/yellow figs. **Hard Cluster:** Each berry is placed in 1 cluster only.

Implementation Time

For the ease of comprehension of the technical details in the demo, only 45 sentences are shown and analyzed in the notebook (less than 0.5% of the actual data). As outputs are insanely large to display here, all the code cells, along with their output, can be viewed readily in my Colab notebook.

Prerequisites: Follow the steps below before proceeding further:

Create a BigQuery dataset and tables under it for read/write of sentences and vectors.
For loading the test CSV file into BigQuery, either create a Google Cloud Storage (GCS) bucket or it can be directly uploaded through a Google Sheet.
Enable all the relevant APIs, including BigQuery and Dataflow APIs.

These sections describe the overall procedure in further depth.

1. Install the required packages

SpaCy’s trained models are available as Python packages, out of which the largest English model (en_core) is employed throughout. For efficient communication in the Apache Beam pipeline, the ‘PROTOcol BUFfer’ is necessary to serialize the data into a binary stream.

! pip install spacy
! python -m spacy download en_core_web_lg
! pip install apache-beam[gcp,interactive,dataframe]
! pip install protobuf==3.12.02

2. Working with SpaCy and Beam’s RI API

Here is where and when the real work and fun begin. Let’s learn how this special NLP library works and vectorizes text.

2.1. Elementary SpaCy Operations

Here, the model ‘en_core_web_lg’ is loaded and creates a 300-dimensional vector ‘vec’ for the text ‘example_string’.

import spacy
nlp = spacy.load("en_core_web_lg")
example_string = "This medium blog deals with text clustering mechanisms premised on GCP and NLP."
doc = nlp(example_string)
vec = doc.vector

2.2. Implement SpaCy model handler with Beam’s RI API

Let’s now define a model handler class called ‘SpacyInfer’ to leverage spaCy for inference. In the function ‘run_inference’, the vector is inferred for each sentence and stored in the ‘infers’ list.

import apache_beam
from apache_beam.ml.inference.base import RunInference
from apache_beam.ml.inference.base import ModelHandler
from apache_beam.ml.inference.base import PredictionResult
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner


class SpacyInfer(ModelHandler[str, spacy.Language, PredictionResult]):
      def __init__(self, model_name):
              self._model_name = model_name
      
      def load_model(self):
          return spacy.load(self._model_name)
      
      def run_inference (self, sents, model, inference_args):
              infers = []
              for sent in sents:
                  doc = model(sent)
                  infers.append(doc.vector)
              return infers

3. Run the Beam Pipeline

Here, BigQuery IO is used for reading from and writing to a BigQuery table in the Beam pipeline. For this, inside the GCP project with ID ‘cluster-project-id’, a ‘clustering’ dataset is created that will contain the table ‘sents’ for sentence read and vector write operations.

3.1. Create the Pipeline

By setting PipelineOptions, the resource usage and execution of the pipeline can be easily configured. The runner, name, and regional endpoint for the Dataflow job can be specified as shown in the below example:

from apache_beam.options.pipeline_options import PipelineOptions
beam_options = PipelineOptions(
     runner='DataflowRunner',
     project='cluster-project-id',
     job_name='vectorjob',
     temp_location='gs://clusterbucket/temp',
     region='us-central1')

Here, we’ll rely on InteractiveRunner() for iterative development and inspection of pipelines.

pipeline = apache_beam.Pipeline(InteractiveRunner())

3.2. Read from BigQuery + RI + Verify

Using ‘apache_beam.io.ReadFromBigQuery’, sentences can be extracted from the table ‘sents’ by executing a simple SQL query: “SELECT sentence FROM cluster-project-id.Clustering.sents”. Then, the RI PTransform is used to implement inference in the Beam pipeline on these sentences.

# This will generate 300-dimensional vectors for 45 sentences each as the inference results. 

with pipeline as p:
    spacy_df = p | "FetchSentences" >> apache_beam.io ReadFromBigQuery(query='SELECT sentence FROM cluster-project-id.Clustering.sents', gcs_location='gs://clusterbucket/temp')
    spacy_df1 = spacy_df | "RunInferenceSpacy" >> RunInference(SpacyInfer("en_core_web_lg"))

Note: Alternatively, the operation can be performed using the streaming API or the ‘Streaming Insert’ method, eliminating the need to specify ‘gcs_location’.

3.3. Write to BigQuery

The inference vector results are written in the same BigQuery table in the form of arrays. Hence, the resulting schema has two fields, one for the sentence string and the other to store multiple floating-point integers in the form of an array (indicated by ‘repeated’).

# Schema object (JSON)

json_schema = {'fields': [
{'name': 'Sentence', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'Vector', 'type': 'FLOAT', 'mode': 'REPEATED'}]}

# Alternatively, the 'repeated' field can be appended

array_schema = bigquery.TableFieldSchema()
array_schema.name = 'Vector'
array_schema.type = 'FLOAT'
array_schema.mode = 'repeated'
json_schema.fields.append(array_schema)


# Schema string (doesn't support nested/repeated fields)

vector_schema = 'Sentence:STRING'

# Write to BigQuery
# 'spacy_df1' represents the previous pipeline operation, emptied 'sents' table is re-written

spacy_df1 | "Write to BigQuery" >> beam.io WriteToBigQuery(
  cluster-project-id.Clustering.sents,
  schema=table_schema,
  write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
  create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
)

3.4. Visualize the Pipeline Data

The pipeline can be visualized under ‘Interactive Beam’ tab of the Vertex AI JupyterLab Notebook with ‘apache-beam-jupyterlab-sidepanel’ extension pre-installed. In Colab notebook, this extension can be easily installed with a few ‘npm’ commands.

Instead of using the pre-installed extension, you can also run the following code to observe and visualize what the ‘spacy_df1’ pipeline outputs.

abib.show(spacy_df1, visualize_data=True, include_window_info=True, duration=3)

Similarly, the following code can be run in order to capture different vector statistics produced by ‘spacy_df’ pipeline. Fascinating, isn’t it?

import apache_beam.runners.interactive.interactive_beam as abib
abib.show(spacy_df, visualize_data=True, include_window_info=True, duration=3)

Fun Part: You can play around with the notebook by adjusting various parameters like window size, time, etc. for a better visual data display.

4. Clustering by BigQuery ML

From this section onward, you can run the SQL queries right from the BigQuery console. The same can be done in the Colab notebook as well by using the BigQuery Client IPython magic command “%%bigquery — project project-id” in the concerned cells.

Weren’t we waiting for this most interesting and important step? For clustering, the sentences selected primarily belong to documents on ‘judiciary’, ‘industry’, ‘employment’, and ‘tourism’.

Without requiring data transfer, BigQuery ML allows you to train and perform batch inference with ML models via normal SQL queries.

4.1. Make BigQuery Table ‘ML-Compatible’

For the BigQuery table's ‘sents’ to be suitable for clustering, the vector needs to be processed properly. All the 300 array entries need to be segregated into 300 different columns. Sounds crazy, right?

%%bigquery --project cluster-project-id

CREATE TABLE `cluster-project-id.Clustering.vector_splits`
AS
WITH data AS (
  SELECT
    Sentence as string_field_0,
    Vector[OFFSET(0)] as double_field_1, Vector[OFFSET(1)] as double_field_2, Vector[OFFSET(2)] as double_field_3, 
    Vector[OFFSET(3)] as double_field_4, Vector[OFFSET(4)] as double_field_5, Vector[OFFSET(5)] as double_field_6, 
    Vector[OFFSET(6)] as double_field_7, Vector[OFFSET(7)] as double_field_8, Vector[OFFSET(8)] as double_field_9, 
    .                          .                          .                          .                          .  
    .                          .                          .                          .                          .  
    .                          .                          .                          .                          .  
    .                          .                          .                          .                          .                        
    Vector[OFFSET(291)] as double_field_292. Vector[OFFSET(292)] as double_field_293. Vector[OFFSET(293)] as double_field_294, 
    Vector[OFFSET(294)] as double_field_295. Vector[OFFSET(295)] as double_field_296 Vector[OFFSET(296)] as double_field_297, 
    Vector[OFFSET(297)] as double_field_298; Vector[OFFSET(298)] as double_field_299; Vector[OFFSET(299)] as double_field_300
  FROM `cluster-project-id.Clustering.sents`
)

Well, just observe the resultant tabular output ‘vector_splits’ which has 45 x 301 cells in the Colab notebook.

Tabular output in the BigQuery console for the query “SELECT * FROM `Clustering.Splits_New`"

4.2. Create and train the model

We’ll select all the columns other than the sentence one using ‘EXCEPT’. By specifying the below parameters, the model ‘hard_clustering’ is created.

%%bigquery --project cluster-project-id

CREATE OR REPLACE MODEL Clustering.hard_clustering
OPTIONS (model_type='kmeans', NUM_CLUSTERS = 4, 
        DISTANCE_TYPE = 'cosine', kmeans_init_method = 'KMEANS++') AS
SELECT * EXCEPT(string_field_0) FROM `Clustering.vector_splits`

Here, under OPTIONS, the type is chosen as ‘k-means’, where ‘k’ or the number of document clusters, is set to four. As we are much more interested in the dot product between the vectors, DISTANCE_TYPE is chosen as ‘cosine’ rather than ‘euclidean’. Smarter centroid initialization in ‘kmeans++’ introduces computational overhead but may compensate for it via faster convergence.

4.3. Evaluate the model

Clustering metrics such as the Mean Squared Distance and Davies-Bouldin Index can be used for validation and can be figured out using ‘ML.EVALUATE’ as follows:

%%bigquery - project cluster-project-id

SELECT *
FROM ML.EVALUATE(MODEL Clustering.hard_clustering)

Evaluate the cluster assignment to a sentence using ML.PREDICT.

%%bigquery --project cluster-project-id

CREATE TABLE `cluster-project-id.Clustering.sent_cluster`
AS
WITH data AS (
  SELECT 
      string_field_0, 
      centroid_id
  FROM ML.PREDICT(MODEL `Clustering.hard_clustering`,
  (SELECT * 
   FROM Clustering.vector_splits))
)

Using the above result, aggregate the sentences cluster-wise using ARRAY_AGG.

%%bigquery --project cluster-project-id

SELECT
    centroid_id,
    ARRAY_AGG(string_field_0)
FROM Clustering.sent_cluster
GROUP BY centroid_id

Clustered sentences output (Tabular and JSON Format) in the BigQuery console

4.4. Fuzzier Clustering

The hard clustering mechanism places almost 95% of the sentences in their correct document clusters. To increase the accuracy, a fuzzier approach can be adopted by taking into account the nearest centroids also.

%%bigquery --project cluster-project-id

SELECT 
    string_field_0, 
    NEAREST_CENTROIDS_DISTANCE 
FROM ML.PREDICT(MODEL `Clustering.hard_clustering`,
(SELECT * FROM Clustering.splits_new))

Nearest centroids computation output (Tabular and JSON Format) in the BigQuery console

4.5. Visualize the Metrics in the GCP Console and Looker Studio

Phew! Enough SQL and Python coding. In this subsection, we’ll just visualize different metrics.

We know that the loss and the number of iterations are inversely proportional before hitting the law of diminishing returns. The pipeline and all the metrics related to clustering, model training, and evaluation can be viewed graphically in the BigQuery console itself. A few are shown below:

Model training and evaluation metrics from the BigQuery Console

Finally, 4 clusters are created with the sentences related to ‘employment’, ‘judiciary’, ‘industry’, and ‘tourism’. So, clusters 1, 3, and 4 contain twelve sentences each, leaving out cluster 2 with the remainder. Cluster 2, containing nine sentences related to ‘judiciary’ can be visualized in the Looker Studio in tabular and graphical format.

Visualizing 9 sentences of cluster 2 (Judiciary) in Looker Studio

Challenges and Applications

The clustering mechanism can be used for speeding up document similarity score computation, plagiarism detection, and QA systems. However, the technique will not necessarily assign correct clusters to all the sentences, resulting in two dissimilar sentences being placed in the same cluster (False Positive) and similar sentences being put in different clusters (False Negative). There is a need to establish a better trade-off between the two desirable metrics: precision and recall.

To achieve higher overall accuracy, advanced techniques need to be integrated so that the sentence provides concrete information for its correct mathematical encoding and processing as per our use case. In future blogs about clustered text, we’ll discuss a plethora of challenges while creating even a rudimentary makeshift QA system, let alone a viable chatbot. The below meme somewhat tells the story :)

**Background:** https://www.andyreynolds.com/image/I0000rdLqiFYoXlw Question asks “Where is my answer?” Finding the needle from the clustered haystack is a bit less challenging.

Conclusion

In this blog, we witnessed and visualized how Dataflow functions as a SpaCy ML model handler for smooth data transfer in the pipeline. Here, the ‘k-means’ model is known to work well on smaller datasets, resulting in minimally inaccurate cluster assignment. The powerful integration of SpaCy and GCP is observed to perform clustering with high accuracy (>94%), especially on larger datasets.

While the complete dependence was on SpaCy for all the NLP-related tasks, the same was true for GCP when it came to ML. This is equivalent to making use of the best of both worlds. I will soon create blogs where GCP is seen handling NLP tasks as well via AutoML NL and DialogFlow as primary products, along with the Contact Center AI (CCAI) Platform.

Stay tuned till then, take care, and have a wonderful march ahead!

Note: Should you have any queries about this post or my Colab notebook implementation, please feel free to connect with me on LinkedIn! Thanks!

New Colab notebook link:

AA_NLP_Medium_Blog.ipynb

Colaboratory notebook

drive.google.com

Related Google Cloud Blog and Youtube Video links:

Dataflow ML as a Sequential Model Handler for Word Clustering

Dataflow ML Pipeline: Words -> (Spacy + Sklearn) -> Clusters

medium.com

Dataflow ML for Word Clustering (Parts 1 and 2)