Develop Secure End-to-End Machine Learning Solutions in Google Cloud

Rubens Zimbres
11 min readMar 13, 2022

--

In this month’s article I present how to secure Machine Learning solutions in Google Cloud. Some points approached:

  • how to handle batch and streaming data for training and inference
  • how to anonimize batch and stream data
  • connect securely to your Jupyter/VSCode via internal IP
  • integration of Compute Engine and Vertex AI with Kubernetes and Coud Run
  • running cryptographically signed container images on Kubernetes for increased security
  • integration of load balancers and securing an interface with external users

As data scientists, we work daily with data. Good and bad data in various formats: csv’s, JSONs, text, images, audios, equipment telemetry. As we know, many times these data suffer transformations before we are able to analyze them or develop a Machine Learning model. Like an audio that was transcripted to a JSON using Speech-to-text, parsed into a textual component that will later be tokenized and encoded or even become a word embedding as an input for a Natural Language Processing algorithm.

Suppose you have audios from a client that wants to generate strategic insights from them. Google has an offer called CCAI (Contact Center Artificial Intelligence) that has this proposition and offers the tools to operationalize it. What do we do? How do we collect data from this client? Will we create a sFTP server or will audio files be uploaded directly to storage? Will we develop a direct integration to capture audios in real-time? Is that necessary? Does the client have audio metadata in his Avaya/Genesys system? Which format are the audios, 8 bits, 16 bits, mono (one channel), or stereo (two channels)? Moreover, in case the project enters production, do we have to add additional cloud components, such as Cloud Functions that react to any new file uploaded to storage or a Kubernetes cluster for inference, or even a Load Balancer to protect the endpoint? Does the client want real-time (streaming) analytics? There are tons of questions you must ask yourself (and the client) before proposing a proof of concept or making a commercial offer. This is at the heart of consulting, understanding the client’s needs, expectations and constraints like money, time, skilled personnel and legacy structures. This way you can offer the best possible solution that scales according to clients’ needs, is cost-effective and adds aggregated value to the client’s operations, which means customer satisfaction and profit.

This is at the heart of consulting, understanding the client’s needs, expectations and constraints like money, time, skilled personnel and legacy structures.

All these questions must be addressed previously to start working, so you must plan an infrastructure solution for development and production, along with costs involved. For development, audios will require a bucket in storage so that they can be read by the Python script, transcriptions (JSONs) must be stored in a non relational database or a bucket, and then Machine Learning output goes to a relational database for analytic purposes. Add a final layer for the dashboard. Simplifying, you will store audios in Google Cloud storage, JSONs with transcriptions in Datastore/Storage, Python script will be hosted in a Compute Engine instance or in Vertex AI (Google Cloud Machine Learning tool that runs R and Python scripts) and we will use BigQuery for relational data. BigQuery is a data warehouse that focuses on analytic purposes and has direct integration with DataStudio (dashboard).

As cloud structure is modular, you are free to create your own solution, as there is always more than one solution to a problem. You can run Jupyter notebooks in Vertex AI (with Anaconda environment), locally in your computer or even create an instance in Compute Engine (VMs) so that you can SSH into it and open in your computer, using Jupyter, VSCode, Spyder or PyCharm. As a dashboard, you can use DataStudio or Looker. For Machine Learning inference, you can use Cloud Run, a Kubernetes cluster, Vertex AI, or Dataproc. Of course there is always the best solution in terms of costs, efficiency and scalability that follows Google’s best practices. Figure 1 shows an example of a Machine Learning architecture made with Google Cloud drawing tool, available here.

Figure 1. Machine Learning architecture.

Once you have the architecture planned, you can use the Calculator to estimate infrastructure costs involved in your solution (https://cloud.google.com/products/calculator). What is the size of data and complexity of the algorithm? Do you need GPUs ? How many of them for how much time? How many APIs will be called? What’s the cost per 1,000s of API calls? As we are talking about the development phase, you can easily integrate Google Cloud SDK library to your Python, Java, Go or Node.js code so that you can call different services and APIs, like storage, BigQuery, Natural Language API (for named entity recognition, sentiment analysis, classification), Vision API (for classification, object detection, face recognition, metadata generation), Document AI (OCR), Speech-to-text, Text-to-speech and others.

As we may collect data from on-premises facilities, it’s important that we have a secure connection. For production environments, Dedicated Interconnect is a GCP service that connects internal IPs from on-premises installation to Google Cloud Points of Presence (POPs), with a bandwidth greater than 10 Gigabits/second up to 100 Gigabits/second. Outside Google’s PoPs one can use Partner Interconnect, with a lower bandwidth.

Regarding data ingestion we have some options: Extract-and-load (EL), Extract-Load-Transform (ELT) or Extract-Transform-Load (ETL). Choose extract-and-load (EL) when you have a .csv, .avro or .parquet file that needs no transformation to be loaded in your script. Extract-Load-Transform (ELT) happens when you need to make small transformations to your data, whose source may be a storage bucket or a SQL database. Extract-Transform-Load (ETL) is necessary when you have bigger transformations applied to your data. In my point of view, this last option is especially beautiful, because you use different components that synchronize in a solution and can be applied to streaming analytics.

One good example of streaming analytics in production is the use of PubSub with Dataflow. PubSub is an asynchronous bidirectional message delivery service, where a subscriber (at the edge) publishes a message in a PubSub topic, allowing the relation one-to-many, many-to-one and many-to-many, that is, one subscriber can publish in more than one topic and one topic may receive messages from various subscribers and so on. It is an alternative for Apache Kafka, it has a latency of 100 milliseconds and can prioritize latency by asynchronous message delivery or deliver messages in batch, synchronously, considering for instance a window of time. PubSub handles different protocols and schemas, it’s fault tolerant (the reason why we have duplicate messages), scales fast with low latency and has high throughput.

Some of the applications of this solution include real-time collection and analysis of equipment telemetry, IoT, Predictive Maintenance, Mobile Gaming and Video streaming. So, why is Dataflow necessary, then? Well, as PubSub delivers duplicate messages, it’s necessary to deduplicate them. Besides, depending on the source of messages, it may be necessary to anonymize data in real time to be GDPR compliant. Anonymization can be done with Dataflow itself, or using an API called Data Loss Prevention, that identifies and censors sensitive data, such as names, emails, SSNs (social security numbers), accounts and credit card numbers.

Dataflow is based on Apache Beam and is able to run Java, Python and PySpark scripts. It’s serverless and perfect for streaming solutions but also runs batch jobs, with the same code. It makes use of MapReduce computation (Figure 2), in parallel. It uses PCollection (Parallel Collection), serialized ByteStrings as the start and end of a transformation. ParDo implements the parallel processing and acts on each PCollection. You can use GroupByKey on shuffled data, CombineGlobally and CombinePerKey to sum, Partition to split data, among other commands.

Figure 2. MapReduce parallel processing.

As Dataflow handles streaming data, you must define what kind of windowing you want: fixed window (telemetry data, for instance), sliding window (mobile gaming) or session-based window (e-commerce interaction), as seen in Figure 3.

Figure 3. Different types of windows used by Dataflow.

After data is ingested by PubSub and preprocessed out of Dataflow, what happens? Well, as you can remember from Figure 1, you can load them into BigQuery, a data warehouse with columnar and relational structure, that runs SQL queries really fast and supports streaming data. Here, you have some options: first, you can develop a Machine Learning model like logistic regression, gradient boosting, decision trees or even NLP model with word embeddings using a SQL-like syntax inside BigQuery. Second, you can connect BigQuery directly to a DataStudio dashboard to generate strategic insights. Third, this data that arrived in BigQuery from Dataflow can be used as an input for model training and inference in VertexAI and then the results can be displayed in the dashboard or come back as a feedback to a mobile user via AppEngine. Fourth, you can send BigQuery data to AutoML, which are pre-trained models and execute tasks like detect objects, classify images, track objects in videos, extract named entities, make sentiment analysis, regression on tabular data and time series prediction. Fifth, you can send data to a container-based service for inference, like Cloud Run or Kubernetes (GKE).

Besides Vertex AI, Kubernetes and Cloud Run can run your Python code to automate inference. You probably have saved weights of your NLP model that will be used to make predictions. Which service will you choose to make the inference? It depends on the computational requirements of your model and scalability. Cloud Run is a container-based serverless component of Google Cloud and can be used with Python, Scala, Java, Go natively. As a container, you can run anything you want on it, as long as it fits on memory and disk space.

Kubernetes (Google Kubernetes Engine) is a more robust solution for Machine Learning inference, also based on containers, with autoscaling, auto-healing, infrastructure based on clusters, security scanning of containers, patched (up-to-date) images, nodes (which are VMs with operational system), pods and containers (Figure 4). In both cases (Cloud Run and Kubernetes) you define a Dockerfile with the environment, files to be copied, libraries to be installed/updated in the operational system, script to be executed and port exposed, and for GKE, a .yaml file defining the application, network, container image, port, authentication and other properties.

Figure 4. A Kubernetes cluster, showing containers in green.

The Python code within Kubernetes will have a Flask interface to handle GET and POST requests in the endpoint. You can add this Python code to Cloud Source Repositories via git commands and push them into the Container Registry using Cloud Build. To improve security, the container images can be signed with an cryptographic keypair on Cloud KMS (Key Management Service) that are validated and allowed through a Binary Authorization policy. This way, GKE will only run safe and validated images, decreasing the possibility of attacks and invasions. The images will then be available for Google Kubernetes Engine (GKE). By following this methodology, when you want to update your Kubernetes deployment, you simply change code, commit the changes and the new clusters will be created with this code update, or all the clusters will be updated at once, depending if you choose a rolling update (GKE default) or a blue/green update.

Once the code and packages are deployed into Kubernetes and are available for inference, there are two options, according to the infrastructure chosen: you can use an updated BigQuery dataset to retrain your model in Vertex AI, or you can use Cloud Functions to check for new data in storage. When a new dataset is uploaded to Storage, Cloud Functions will trigger a Vertex AI training job. This step is specially important to avoid data drift problems and improve generalization properties of your model.

This whole infrastructure will be located inside Google Cloud (Figure 5) in a VPC (Virtual Private Cloud) and following Google Cloud security best practices of minimal external exposure, machines and services will connect to each other through internal IPs. The only service using external IP will be the Load Balancer (where internet users connect to). You may connect to the Compute Engine instance through SSH (port 22) on internal IP using IAP tunnels (Identity-Aware Proxy), where you may connect your Jupyter or VSCode safely through TCP forwarding. APIs are called from internal IPs, once you activate Private Google Access on the subnetwork level. The final Kubernetes Application or even App Engine endpoints will be protected by a HTTP(S) Load Balancer, to prevent DDoS (distributed denial of service). There is also a possibility to add a Cloud Armor, between the application (GKE or App Engine) and the Load Balancer, to provide additional protection against Cross-Site Scripting (XSS), SQLi (SQL injection) and remote code execution. Remember that firewall rules must always be present. The whole infrastructure is inside a secure perimeter, defined by VPC Service Controls, where APIs access is given by Access Control Manager also at the subnetwork level. So, there is only one point of exposure to external users.

Two important tools for monitoring the infrastructure and debugging are Cloud Monitoring and Cloud Logging, that are part of Google Cloud Operations Suite. Monitoring is the base of Site Reliability Engineering (SRE) and is useful for creating alerts, dashboards of events and helps a lot during incident response and also during root cause analysis. Cloud Logging allows you to analyze application logs, make queries on them and export to GCP Storage (which is a long term storage at low cost), BigQuery (for analytics purposes) and PubSub. Besides, we can also count on Security Command Center, for threat detection, web security scanning, anomaly detection, verification of open ports (SSH and RDP), exposed IPs, verification of public buckets (risky), verification of malware, cryptomining, XSS (Cross-Site Scripting), SQLi (SQL Injection), clear text passwords and many more. The following diagram shows a global view of a Machine Learning solution on Google Cloud, properly secured.

Figure 5. Global Machine Learning solution in GCP. Full size image available here.

End-to-end solutions require extensive planning, consideration regarding security of applications, time, effort and a synchronized team composed of project managers, architects, machine learning engineers, data scientists, business intelligence analysts, among others. As data scientists, we don’t need to know every little tiny detail about architecture and data engineering. But if we have a global knowledge about how things work, we will be better prepared to interact with other specialists, become competent managers and grow in our careers.

--

--

Rubens Zimbres

I’m a Senior Data Scientist and Google Developer Expert in ML and GCP. I love studying NLP algos and Cloud Infra. CompTIA Security +. PhD. www.rubenszimbres.phd