Google Cloud for Education: Strategy Segmentation using Generative AI

Published in

Google Cloud - Community

10 min readJul 19, 2023

In Colombia and in Brazil, the Ministry of Education is responsible for defining the general guidelines and frameworks for education, including pedagogical projects. However, the implementation of these projects at the local level, specifically in municipalities, is the responsibility of the respective Municipal Education Secretariats or Departments of Education.

These pedagogical projects aim to improve the quality of education, promote student learning, and enhance the overall educational experience in each municipality. They may include strategies for curriculum development, teacher training, assessment methodologies, educational resources, and community engagement, among other elements.

Many times, the municipality receives dozens, hundreds of pedagogical projects, each one with 60–140 pages and it is very difficult to read and organize them all. Automation is a good solution to “read” and organize all this data in clusters, in order to make it easier to implement educational strategies across the state.

According to Michael Porter (1988), “to increase productivity, factor inputs must improve in efficiency, quality and ultimately specialization to particular cluster areas. A cluster is a critical mass of companies in a particular location (a country, state, region or even a city)”. Thus, governments “have significant roles in creating an environment to support rising productivity”. He also states that being part of a cluster allows participants to “operate more productively in sourcing inputs”, and accessing information and technology. In addition to enhancing productivity, clusters increase company’s ability to innovate.

Once a cluster begins to form, a self-reinforcing cycle promotes its growth, as well as the influence of government that will be able to also implement human resources strategies to foster the hiring process of the students attending to a specific cluster of schools with similar educational strategy.

Although Porter focuses mainly on spatial clusters (geolocation, like California wine cluster, Italian shoe cluster, computer companies in Silicon Valley), here I will use the notion of conceptual clusters, given the semantic meaning of pedagogical projects

The solution

One could think of using Topic Modeling (Latent Dirichlet Allocation) for the task of grouping similar schools according to their pedagogical projects, but LDA does not consider semantic meaning, order of the words nor the grammatical role of them, and may generate uncorrelated topics.

Here, I will adopt a generative and semantic approach. I will OCR the documents (pedagogical projects), translate them to English, summarize them (LLM), generate embeddings (LLM) and use unsupervised learning to generate groups of similar content. This allows the Departments of Education to organize and segment different educational strategies in clusters, according to the semantic similarity of their pedagogical projects, in order to make educational management easier and more effective.

How-to Guide

For OCR, we will use Google Cloud Document AI (remember to enable the APIs first: Document AI, Translation API and Vertex AI). We will create a processor (instance) in Google Cloud Console in ‘us’ region and get the processor_id to define variables for a batch OCR processing.

Here we have some important details to be noticed. We will not use Form Parser, because it is 43 times more expensive that simple OCR and it is not necessary. Second, we will not use synchronous OCR processing, because this type of OCR processes a maximum of 15 pages, and we have documents as large as 140 pages.

Let’s import all necessary libraries for the project:

import os
from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai
from google.api_core.operation import Operation
from google.cloud import storage
import vertexai
from vertexai.preview.language_models import TextGenerationModel
from vertexai.preview.language_models import TextEmbeddingModel
import re
from nltk import FreqDist
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import time
import matplotlib

Next, we define the Google Cloud credential of the service account that will call the APIs (if you are working locally). If you don’t have it, go to IAM / Service accounts, Create service account / Keys / Json / Download:

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/user/key.json'

Now we iterate over the directory containing the PDFs:

for filename in os.listdir(directory):
    if os.path.isfile(os.path.join(directory, filename)):
        print(filename)
        ## add code in this indentation

Here, instead of a loop, you can use a map function in multiprocessing, to parallelize API calls.

We create a Storage bucket called edu-pdfs and a folder called output, and define some variables:

project_id = 'your-project'
location = 'us'  # Format is 'us' or 'eu'
processor_id = '667a8787bec8ba'  # Create processor in Cloud Console

# Format 'gs://input_bucket/directory/file.pdf'
gcs_input_uri = "gs://edu-pdfs/"+filename
input_mime_type = "application/pdf"

# Format 'gs://output_bucket/directory'
gcs_output_uri = "gs://edu-pdfs/output"

Now, let’s create an asynchronous function to OCR the PDFs, as well as a function to rebuild the text from a .json output located at a specific folder in Google Cloud Storage bucket:

def batch_process_documents(
    project_id: str,
    location: str,
    processor_id: str,
    gcs_input_uri: str,
    input_mime_type: str,
    gcs_output_uri: str,
) -> Operation:
    opts = {}
    if location == "us":
        opts = {"api_endpoint": "us-documentai.googleapis.com:443"}

    # Instantiates a client
    documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    resource_name = documentai_client.processor_path(project_id, location, processor_id)

    # Cloud Storage URI for the Input Document
    input_document = documentai.GcsDocument(
        gcs_uri=gcs_input_uri, mime_type=input_mime_type
    )

    # Load GCS Input URI into a List of document files
    input_config = documentai.BatchDocumentsInputConfig(
        gcs_documents=documentai.GcsDocuments(documents=[input_document])
    )

    # Cloud Storage URI for Output directory
    gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
        gcs_uri=gcs_output_uri
    )

    # Load GCS Output URI into OutputConfig object
    output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)

    # Configure Process Request
    request = documentai.BatchProcessRequest(
        name=resource_name,
        input_documents=input_config,
        document_output_config=output_config,
    )

    # Future for long-running operations returned from Google Cloud APIs.
    operation = documentai_client.batch_process_documents(request)

    return operation

The output of each OCR is saved as one or more .json files in the folder output and its subfolders. We will retrieve them and build our text:

# RETRIEVE JSON FROM OUTPUT BUCKET

def get_documents_from_gcs(
    gcs_output_uri: str, operation_name: str
) -> [documentai.Document]:

    # The GCS API requires the bucket name and URI prefix separately
    match = re.match(r"gs://([^/]+)/(.+)", gcs_output_uri)
    output_bucket = match.group(1)
    prefix = match.group(2)

    # The output files will be in a new subdirectory with the Operation ID as the name
    operation_id = re.search("operations\/(\d+)", operation_name, re.IGNORECASE).group(1)

    output_directory = f"{prefix}/{operation_id}"

    storage_client = storage.Client()

    # List of all of the files in the directory `gs://gcs_output_uri/operation_id`
    blob_list = list(storage_client.list_blobs(output_bucket, prefix=output_directory))

    output_documents = []

    for blob in blob_list:
        # Document AI should only output JSON files to GCS
        if ".json" in blob.name:
            document = documentai.types.Document.from_json(blob.download_as_bytes())
            output_documents.append(document)
        else:
            print(f"Skipping non-supported file type {blob.name}")

    return output_documents

Then we create an operation for the batch processing of documents:

# OCR the PDFs

operation = batch_process_documents(
        project_id=project_id,
        location=location,
        processor_id=processor_id,
        gcs_input_uri=gcs_input_uri,
        input_mime_type=input_mime_type,
        gcs_output_uri=gcs_output_uri,
    )

# Format: projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID
operation_name = operation.operation.name

# Continually polls the operation until it is complete.
print(f"Waiting for operation {operation_name} to complete...")
result = operation.result(timeout=300)

Now we will build our OCR’d text from jsons in the output folder. Also, as we will iterate over all PDFs folder, I will assign OCR’d text to a variable doc and delete the current json from the output folder. To use gsutil you will need to install it.

# BUILD TEXT

document_list = get_documents_from_gcs(
        gcs_output_uri=gcs_output_uri, operation_name=operation_name
    )

out=''

for document in document_list:
    out+document.text

doc=out.replace('\n',' ')

os.system('gsutil rm -r gs://edu-pdfs/output/*')

(This codelab has a detailed explanation on how to do OCR in Google Cloud)

Excellent. However, the text we obtained is in Spanish, and we need to submit English text to the PaLM API for summarization. Let’s translate it:

# TRANSLATE TO ENGLISH

def translate_text(target, text):
    time.sleep(1)
    import six
    from google.cloud import translate_v2 as translate
    translate_client = translate.Client()
    if isinstance(text, six.binary_type):
        text = text.decode("utf-8")
    result = translate_client.translate(text, target_language=target)
    return result

time.sleep(2)

translated=translate_text('en',doc)["translatedText"]

After that, we will summarize the text that was OCR’d and translated so that we can create embeddings in a further step:

# SUMMARIZATION

def predict_large_language_model_sample(
    project_id: str,
    model_name: str,
    temperature: float,
    max_decode_steps: int,
    top_p: float,
    top_k: int,
    content: str,
    location: str = "us-central1",
    tuned_model_name: str = "",
    ) :
    vertexai.init(project=project_id, location=location)
    model = TextGenerationModel.from_pretrained(model_name)
    if tuned_model_name:
        model = model.get_tuned_model(tuned_model_name)
    response = model.predict(
        content,
        temperature=temperature,
        max_output_tokens=max_decode_steps,
        top_k=top_k,
        top_p=top_p,)
    return response.text

summary=predict_large_language_model_sample("yout-project", "text-bison@001", \
0.2, 255, 0.95, 40, '''Provide a summary of 20 items for the following text \
with emphasis on pedagogical principles adopted: {}.Summary:'''.format(translated), \
"us-central1")

Students are motivated and empowered in the following values and attitudes:

Ethical, citizen and humanist values and attitudes
Research skills and attitudes
The exercise of reason
The aesthetic sense and creativity
Leadership and collaborative work
The purposeful and proactive sense
Respect for diversity and pluralism
Mastery of the mother tongue
The development of competences in a foreign language
business vision

The institution seeks to clearly guide the institutional values and principles through:

Development of critical thinking
Competence-based training
Entrepreneurship development
Continuous improvement in search of high quality standards

Here, you may want to tune hyperparameters, adapt the prompt to your goals, and also translate back to Spanish to check the summarizations:

Los estudiantes son motivados y potenciados en los siguientes valores y actitudes:

Valores y actitudes éticas, ciudadanas y humanistas
Capacidades y actitudes para la investigación
El ejercicio de la razón
El sentido estético y la creatividad
El liderazgo y el trabajo colaborativo
El sentido propositivo y proactivo
El respeto por la diversidad y el pluralismo
El dominio de la lengua materna
El desarrollo de competencias en una lengua extranjera
La visión empresarial

La institución busca orientar claramente los valores y principios institucionales a través del:

Desarrollo del pensamiento crítico
Formación por competencias
Desarrollo del emprendimiento
Mejoramiento continuo para la búsqueda de altos estándares de calidad

It is necessary to apply some regex to the text to clean it:

summary_clean=re.sub(r'[^a-zA-Z0-9\s]', '',summary.replace('\n',' '))

Finally, we create the embeddings for each one of the summaries (with model gecko@001), so that we can group them according to the similarity of educational strategy:

# EMBEDDINGS

def text_embedding(text_input):
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    embeddings = model.get_embeddings([text_input])
    for embedding in embeddings:
        vector = embedding.values
    return vector

embed=text_embedding(summary_clean)

This will generate embeddings for each one of the summarized PDFs composed by 768 floats, defining a vector in the multidimensional space. As we want to visualize the results, a PCA (Principal Components Analysis) is used to reduce dimensionality to X,Y coordinates:

pca = PCA(n_components=2)
pca.fit(pd.DataFrame(embed).T)
print(pca.explained_variance_ratio_)

Finally, we will use K-Means clustering to group strategically similar pedagogical projects:

from sklearn.cluster import KMeans
k_means=KMeans(n_clusters=3,random_state=42)
k_means.fit(pca.components_.T)

By plotting the results of K-Means clustering we are able to see the clusters of schools whose pedagogical projects are similar:

colors=['red','blue','green']
fig, ax = plt.subplots()
ax.scatter(pca.components_[0], pca.components_[1], c=k_means.labels_,cmap=matplotlib.colors.ListedColormap(colors),s=65)
for i, txt in enumerate(names):
    ax.annotate(txt, (pca.components_[0][i], pca.components_[1][i]))
plt.title('Clusters of Pedagogical Projects')
plt.show()

Now the Department of Education can manage groups of schools instead of managing hundreds of them individually. Here are some advantages of cluster management:

increased efficiency
increased productivity
foster innovation
decreased human labor cost

The possibility of developing segmented strategy across different clusters, instead of a global unified strategy seem to be beneficial. Gemawat (2005) states that “regionally focused strategies are not just a halfway house between local (country-focused) and global strategies but a discrete family of strategies that, used in conjunction with local and global initiatives, can significantly boost a company’s performance.

“Better results come from strong regional strategies, brought together into a global whole.”

We can also use NLTK to get the frequency of words in order to create wordclouds for each school or cluster, for visualization purposes:

fdist5 = FreqDist([ i for i in df.iloc[m,1].split() if len(i)>3])
wordcloud=pd.DataFrame(fdist5.items(), columns=['Word','Frequency']).sort_values(by='Frequency', ascending=False)

Pricing

The costs (USD) involved in this solution are:

Document AI — 1,000 pages — $1.50 (OCR)

Translation API — First 500,000 characters per month — Free (Translation)

Vertex AI:

PaLM 2 Text Bison — 1,000 characters - $0.0010 (Summarization)
Embeddings Gecko — 1,000 characters — $0.0010 (Embeddings)

Cloud Storage — 1 GB/month — $0.020 (PDFs and jsons)

If you are just testing or replicating this solution, be sure to delete the Document AI processor, otherwise you will incur in unnecessary charges.

If you want to use this solution for a large scale project, you must add this script + Dockerfile + requirements.txt to Cloud Run / Kubernetes running a Flask application and add a Cloud Functions that is activated every time a PDF is uploaded to the Storage bucket and calls the Flask application.

Have fun !

REFERENCES

Ghemawat P. Regional strategies for global leadership. Harvard business review. pp. 83(12):98, 2005.

Porter, M.E. and Porter, M.P. Location, clusters, and the ”new” microeconomics of competition. Business Economics, pp.7–13, 1998. Available at: https://www.jstor.org/stable/23487685

Porter, M.E. Clusters and the New Economics of Competition. Harvard Business Review — Government Policy And Regulation. Available at: https://hbr.org/1998/11/clusters-and-the-new-economics-of-competition

What are clusters ? Harvard Business School — Institute for Strategy and Competitiveness. Available at https://www.isc.hbs.edu/competitiveness-economic-development/frameworks-and-key-concepts/Pages/clusters.aspx

Google Cloud for Education: Strategy Segmentation using Generative AI

The solution

How-to Guide

Pricing

REFERENCES

Written by Rubens Zimbres