Developing a End-to-End NLP text generator application(part 5)- Develop CI/CD architecture to continuously train the model with new data.

Kevin MacIver
11 min readMay 1, 2020

This is part 5 of a series of stories to show the steps to develop an end-to-end product to help write news articles by suggesting the next words of the text.

On part 1 we focused on generating a Bidirectional LSTM model, check out this link if you haven’t seen it yet:

On part 2 we focused on creating a full-stack application with FLASK, check out this link if you haven’t seen it yet:

On part 3 we focused on containerizing our application using Docker and deploying it locally. Check out this link if you haven’t seen it yet:

On part 4 we focused on deploying our containerized application through kubernets using Google Cloud. Check out this link if you haven’t seen it yet:

On this story we will focus on building a CI/CD pipeline using several Google Cloud services, in order to frequently update our model with new data.

If we remember the architecture shown on part 1, we’ll see six major blocks. We’ll breakdown each block to understand their objective and how to implement them.

Architecture for CI/CD using Google Cloud

On part 4 we uncovered the App Deployment block.

Ingest Block

Ingest Block

The ingest block is responsible for gathering news article data from the web, on a daily basis, and save this data in a bucket on cloud storage as a text file we’ll call weekdata.txt.

The weekdata.txt will later be used by the Train Model block to retrain the model with this new data.

To explain the ingestion block we’ll start in its core, which relies on the Compute Engine called Scrapper, and walk our way backward.

Compute Engine — Scrapper

This Compute Engine has a simple job, scrape a defined webpage, do some text processing and save the data on a text file.

Since it is a job that doesn’t involve too much computation power, we can use a simple machine, in this case a share core f1-micro with 614MB.

During the creation we’ll add a label “env:dev” which later on we’ll use for the Cloud Functions.

We also select the App Engine account of our project in the service account box.

You can also allow HTTP traffic which we may use later on. Don’t worry, we can change the setting later on if we need to.

Once the VM instance is created, we need to prepare it to run our python script. Click in the SHH and select Open in Browser window.

We’ll start by running the following commands:

Note: If in doubt, check the following Google Cloud tutorial

With the commands above we’ve installed pip, created an environment and installed google-cloud-storage in our environment.

Now we need to install the other libraries to run the scrapper script. Those are beautifulsoup (bs4) and requests.

requirement.txt — scrapper

Now we can upload the scrape.py script.

For our project, we’ll scrape the google.news site in the business section. Later on we could add other scrappers to fetch other topics, or include other sources too.

Based on the scrapper.py we’ll also need to create a /tmpdir directory.

As we can see the scrapper will create a file called newdata.txt. We need to join this newdata.txt to the weekdata.txt, which is the collection of news of the week.

To achieve this, we need to copy the weekdata.txt from the Cloud Storage bucket. In order to do that we’ll run the following command:

copy weekdata.txt file from the data-updates bucket

We’ll need to modify this data, to resolve any permissions we’ll copy the bucketdata.txt, that now resides in our VM, and rename it as weekdata.txt, then delete the bucketdata.txt.

Now we’ll use another script to join the data from newdata.txt to the weekdata.txt

Okay now that our weekdata.txt is updated we’ll save this updated data to the bucket again.

Copy weekdata.txt to bucket data-updates

That’s it. We’ve done it! 😀

Well, not quite. We need to create a way for the Instance to run all this commands every time it is turned on. Here is where startup-scripts come in handy.

If we stop our instance and go to edit, we’ll find a custom metadata section.

Here we can add a startup-script that will run every time the instance is booted.

The start-up script runs all the same steps we took before.

scrape startup-script

One important note is that the startup-script will not reside in your home directory, where we were working before. Therefore, the first step is to cd to that directory.

Our final metadata will look like this:

Note: You may notice I added a history.txt file to keep the logs of the startups and shutdowns of the instance.

Cloud Functions

Cloud Functions is Google Cloud’s event-driven serverless compute platform. Run your code locally or in the cloud without having to provision servers.

We’ll use cloud functions to execute code to start-up or shutdown our compute instance.

The cloud functions will be triggered by a pub/sub event.

Let’s create the startScrapper function.

In the source code part, we’ll select the inline editor, Node.js, and in the index.js area paste the following code.

In the package.json area we’ll paste the following code.

These codes are all available at the following tutorial:

Cloud Scheduler

Cloud Schedulers are cron job scheduler. It allows you to schedule virtually any job, including batch, big data jobs, cloud infrastructure operations, and more.

Here we’ll use cloud schedule to periodically send a message to a specific pub/sub topic, which will then send this data to trigger the specific cloud function.

start scrapper cloud scheduler

The important aspects here are:

  • Frequency — In this case it’s scheduled to run at midnight every day.
  • Topic — Which should be the same as the trigger from your desired cloud function
  • Payload — Which includes the zone from your scrapper instance and the label. This way, if later we have more scrappers instances, all that have the same labels will be targeted.

In this project a stop instance was also designed, although the startup-script from the scrapper instance includes the command to shutdown. Since the job is quite simple, the instance is able to perform it within a minute. Once the shutdown scrapper is activated, most likely the instance will already be off, but it works as a form to secure the instance won’t be running all day in case there is some failure during the instance startup.

Train Models Block

The Train Models block is responsible for fetching the weekdata.txt, prepared by the ingest block, and retrain the model with this new data.

The services and architecture of the train models block are very similar to the ingest block. The main differences being:

  • Frequency of Cloud Scheduler — Which for this block will be once a week
  • Type of Compute Engine — Since in this block we are dealing with model training we’ll need a more powerful instance than the one used in the ingestion block.

Compute Engine — Train Model

Follow the same steps we took for preparing the scrapper instance. We start by choosing a Machine Type.

For this project a n1-standard-4 (4 vCPUs, 15 GB memory) was chosen. The retrain model took about 13 hours.

Note: Later on, as data increases, we’ll probably need to go to a more powerful machine (maybe GPU, TPU), and/or limit the amount of data to retrain the model.

After setting up the machine for python, and creating our environment, we’ll install the required libraries to end up with the following packages:

train model — requirement.txt

The important libraries here are: numpy, pandas, sklearn, pickle-mixin, and tensorflow (2.0.0)

We’ll also create two folders in our data-update bucket as follows:

data-updates bucket

The docsToLoad folder will hold the following documents:

  • masterdata.txt — File containing the current data collection for the current model running in deployment.
  • vocab.data — List of the current vocabulary of the model based on the masterdata.txt most frequent words.
  • embedded.npy — Numpy array of the embedding of the vocab.data
  • glove.6B.100d.txt — GloVe embedding
  • model_metric.txt — Log containing the metric for each train model run.

In case you’re in doubt in relation to the files check, out part one of this series.

The trained_models folder will receive the model.h5, new embedded.npy and new vocab.data files.

The sequence of events to be performed by the instance are the following:

  • Copy the from the data-updates/docsToLoad folder and weekdata.txt.
  • Create a local copy to enable writing permissions.
  • Run python code to train model
  • Copy updated versions of the masterdata.txt, embedded.npy, vocab.data and model_metric.txt to the data-updates/docsToLoad folder.
  • Erase content from weekdata.txt and upload empty file to the data-updates bucket.
  • Copy the updated versions of model.h5, embedded.npy and vocab.data to the data-updates/trained_models folder adding the date to the names of the files.

Following the same procedure as in the scrapper instance, these steps will be translated as a startup-script for the train model instance.

The python script for retraining the model is the following:

Test Environment Block

Now that we have new models being weekly saved in the Cloud Storage Bucket, we can test them locally before deploying them live.

The purpose of this block is of a gatekeeper, avoiding the deployment of models that may not be performing as expected. It’s also a place to perform other updates, such as in the frontend.

Running on cloud shell, the following steps will be taken :

  • Copy the updated models from the data-updates/trained_models folder.
  • Deploying Docker containers locally with the updated files (see part 3)
  • Performing additional modifications (optional)
  • If the updated app performs as expected then commit the new files to the cloud source repository (or to your github repository).

Manage Container Version Block

Okay. So we tested the new models, or did some updates in the webpage, and we are ready to deploy our new version.

One way to do it is manually, following some of the steps we took on part 4 , i.e.:

  • Generate new docker images based on the updated documentation
  • Push the new docker images to Container Registry
  • Update the image of the container inside the pods.

By doing that, Kubernetes will manage to launch a new deployment, creating new pods, and once all is running it will kill the previous pods therefore updating your app with zero downtime. This process is called rolling updates.

Another way to automatize the rolling update process is using Google Cloud Builds. With Cloud Builds we can set up the steps above into instructions that will be executed after a trigger.

In this project will set up a Cloud Source Repository that will serve as a mirror of the github repository. Whenever changes are made to the repository, this will trigger the Cloud build.

We will start by creating a Cloud Source repository for the project by synchronizing it with the github repository of the project. Once that is set, we can view our documents as follow:

Now we’ll create a trigger on cloud build.

We’ll connect the trigger to our repository and define the event that will set up the trigger.

In this project, any push on any branch of the project will set the trigger.

Once the trigger is activated Cloud Build will look for a cloudbuild.yaml file, so we need to specify the location of that file.

The cloudbuild.yaml file contains the set of instructions Cloud Build will follow.

Notice that _ZONE, and _GKE_CLUSTER must match the current cluster that is deploying the app. (see part 4).

Also, it is important to remember that before a commit is made to the github repository, the _VERSION of the .yaml file must also be updated.

This will guarantee that the new images is deployed to the cluster.

Conclusion of Part 5 🏁

Hurray!!! 👏👏👏👏

In this story we got to create a CI/CD architecture to allow a frequent update our to model with new data.

I hope you’ve enjoyed the ride, you can check my github if you want to get into more details.

I would also like to thank my mentor in this project Arman Didandeh. 👏👏

Next Steps

As mentioned in part 1 of this series, this is an on-going project and there is still much more that can be done.

Here is a list of possible next actions:

  • Include MLflow to keep track of the models and parameters used
  • Add other news topics for the application
  • Rewrite the Frontend with React
  • Test deploying the application with more complex models (e.g. GPT-2)

Thanks for reading!

--

--

Kevin MacIver

Driven for innovation, waiting for the robots uprising..