Achieving confidential and risk-averse AI/ML with advances in ubiquitous data encryption

Published in

Google Cloud - Community

17 min readAug 25, 2022

(This article was originally published on LinkedIn, where an explanatory video is also linked)

Cloud scale AI/ML solutions give unparalleled capacity and speed on demand, but some organisations do not free their engineers to consider them, due to privacy, data sovereignty or security concerns. The need for reassurance can even exceed regulatory requirements, whether for business or cultural reasons. More rigorously ubiquitous, end to end data encryption can allow engineers to offer reassurance, by uniquely combining solutions from organisations such as Google Cloud Platform, AMD and Thales. These security gains are underexploited and little understood, let alone their potential for risk-averse AI/ML contexts.

There is an ironic block to achieving this confidentiality. Cloudnative, secured containerisation is incompatible with the hardware required for machine memory encryption. We suggest a new middleware API approach to resolve this and also consider the role of vTPM cryptographic attestation at boot time, Linux kernel options and T/GPU availability.

Privacy, data sovereignty and business concerns artificially reduce AI/ML appetite
Becoming the encrypted AI/ML provider to your organisation offers secured control
Pinpointing the missing links in the cryptographic story
Bringing AI/ML to confidential data using ubiquitous data encryption
Cloudnative catch 22: containerisation is a security block on ubiquitous data encryption
Getting the confidential AI/ML architecture right for cloudnative workloads
Further reading

The business promise of AI/ML solutions is already well proven in a range of areas, from predictive analytics and document processing to language and video analysis. Cloud scale solutions give unparalleled capacity and speed on demand, but some organisations do not feel free to consider their potential. #Privacy, #DataSovereignty or #Security concerns can face sticky barriers — the implicit trust required of the cloud service provider (CSP) itself and the choice of jurisdiction that houses communal #AIML compute in the cloud. The increasing reach of #MachineLearning increases the number of groups who may potentially benefit, but need reassurance. Where concerns are unspoken or consideration of AI/ML is pre-nascent, potential usage of AI/ML may not even come out into the open.

Advances in CPU level memory encryption and conditional encryption key access by #GoogleCloudPlatform (#GCP) have opened up new possibilities for AI/ML usage that is risk averse or jurisdiction sensitive. This underexploited and little understood opportunity is already being used in financial services and offers more thorough #EndToEndEncryption of data than can normally occur in the cloud. Key access can be kept outwith CSP control and improved jurisdictional control can force #cloudnative AI/ML to meet your data sovereignty needs. All while maintaining the unparalleled cloud scalability and on demand speed that non-batch workloads require.

The architectural challenge we address below is how to apply these technologies to modern containerised workloads, solving an unavoidable limitation in the solution provided by GCP. This is due to the involvement of hardware that your containers should not be able to talk to. We propose API middleware and touch on why GCP Ops Logging of vTPM cryptographic attestations are insufficient. We also briefly explore Linux kernel parameters that should provide ample reassurance, even where instance memory #encryption is not available (for example, alongside G/TPU hardware).

The below is explained later in this article and can undoubtedly be improved. Any suggestions, observations and identification of gaps or errors will be gratefully received.

Privacy, data sovereignty and business concerns artificially reduce AI/ML appetite

National security related, financial services, healthcare and other industries can still be naturally conservative about trusting third party systems with their data or analysis. The cloud is ‘someone else’s computer’ after all. This discomfort can even be cultural to an organisation, department or team, persisting when strict compliance standards are regularly met by providers such as GCP, including ISO, FedRamp, HIPAA, PCI/DSS, EBA, ESMA, PRA and other cloud outsourcing related regulations and standards. API provided CSP services are not immune to this, which is very relevant to machine learning. Pre-canned state of the art GCP AI/ML services are accessed programmatically via public APIs, which are served from secured ‘communal’ compute that is not allocated to you.

Becoming the encrypted AI/ML provider to your organisation offers secured control

Acting as the cloudnative AI/ML service provider to your organisation is the viable hands on alternative — models built, tested, served and curated from your private clusters. These could be in a hybrid or multi cloud #Anthos environment or on Google Kubernetes Engine (#GKE). Although more onerous than consuming Google’s out of the box APIs, this increasingly common approach offers efficient and scalable control to your machine learning engineers and infrastructure administrators via standard open source platforms such as #Kubeflow (which builds on technologies such as Knative for a private cloudnative experience). A quick look at the broader security posture of this private GKE cluster solution shows that it addresses many of the concerns that we opened this article with. But not quite all, as we shall see.

Even apart from GCP’s $10billion #cybersecurity plan, encryption and #cryptography are absolutely inherent to the platform. Your own #cloud clusters can benefit from machine integrity tests from cryptographic signing and hashing of boot firmware and sequence (which come together in Shielded Compute instances, that can also be deployed as GKE Shielded Nodes). You then know your virtual machine itself has not been interfered with, even by a compromised hypervisor. Data is always encrypted at rest in storage. And there are options for increasing degrees of on and off-cloud control, such as integration of external #keymanagement systems or supplying your own keys directly. Plus, stored data is chunked across different locations, each chunk separately encrypted, protecting against breaches of Colossus, the Google File System. Data is also encrypted in flight at the network layer from your machine’s perspective (layer 3), which all occurs on GCP’s Andromeda software defined network (SDN). The rotating keys underlying this are complemented by authentication security tokens, which spoofed packets cannot impersonate. #ZeroTrustSecurity discipline can further encrypt between microservices, such as using mTLS within Istio or other service mesh frameworks (not limited to containerised workloads). GCP does this at three times the network throughput that other CSPs can achieve (in part because of custom NIC hardware that offloads cryptography from the CPU). And on your cluster in the GCP region and jurisdiction of your choice (versus GCP AI/ML API location, where ‘Global’, ‘US’ or ‘EU’ locations may be too broad for your data sovereignty needs).

Pinpointing the missing links in the cryptographic story

All the above control and encryption is critical reassurance and best practice. Bullet proof for regulated scenarios. But there are two holes in the conventional chain of cloud encryption outlined above.

Firstly, machine memory itself is unencrypted. This leads to the highly improbable possibility of malicious actors gaining access to unencrypted data that is being processed, which had been carefully encrypted in storage and in transit. This could hypothetically occur because of a breach of the myriad security measures that Google deploy. Or because a new CPU level or KVM/QEMU hypervisor vulnerability is exploited before it can be identified and patched, allowing someone to explore outside their own virtual machine into the bare metal host.

In #Linux terms, let us set out the improbability of free-range exploration of your virtual machine’s or communal services’ memory. Direct access to /dev/mem or /dev/kmem is not enabled by default in mainstream Linux distributions. Respectively, main memory access would require CONFIG_STRICT_DEVMEM to be disabled in the #kernel on purpose, and access to the entire kernel virtual memory space would require CONFIG_DEVKMEM to be enabled. Best practice is also a requirement of bootable Linux images on GCP. Page attribute tables should also not be intentionally enabled by the nopat kernel parameter on x86 systems, as this would ease malicious memory exploration. GCP’s Container-Optimized OS used for GKE nodes has a hardened approach which exceeds these standards. Given that the bootloader is the most immediate way to manipulate kernel boot options, it is also worth noting that the integrity of all these #LinuxKernel options is protected by UEFI measured-boot hashing in a GRUB context. One can reasonably assume that all the above and more is the default practice on Google’s bare metal machines as well, upon which everything else runs.

Secondly, there is still a residual level of trust required of the CSP relating to management and access to keys for decryption or signing. That required trust even applies if you integrate externally managed encryption keys into GCP Cloud KMS. At its most basic, if GCP asks their or your on-premises systems for a key, GCP will get a key. The permissioning controls on top of that come from role based GCP Identity Access Management (IAM) and GCP metadata. A hypothetical breach of IAM itself and metadata provision would need to be sufficiently specific or broad enough to achieve access to 1) GCP’s Cloud Key Management System (Cloud KMS) and/or 2) access to the on premises #KMS, via Cloud External Key Management (EKM) integration into Cloud KMS.

In short, exploring memory across virtual machine instances is less than unlikely. Some forms of attack would also require physical access to premises, which is even more unlikely. Many would imply a serious level of preplanned complexity — perhaps needing to be state sponsored. The reality is that major cloud providers exceed the level and consistency of security that nearly any organisation could achieve by itself. Plus, there are a number of unspoken ‘ifs’ in the hypothetical possibilities above.

Nonetheless, zero trust security exists as a discipline for a reason. From Meltdown and Spectre to hypervisor holes, ‘the unknown unknowns’ do not pre-announce themselves. This means that there is demand for encrypted memory compute to further enhance cloud security, not just for #artificialintelligence workloads. A need that has been met…

Bringing AI/ML to confidential data using ubiquitous data encryption

The silicon giants have met this market demand with encrypted memory solutions, which also open up a new CSP-blocking key management paradigm for more ubiquitous, end-to-end encryption. These gains are underexploited and little understood, let alone their potential for risk-averse AI/ML contexts. Too often, GCP-provided API based services remain the only first class citizens in corporate AI/ML conversations. Initial consideration of machine learning possibilities then do not dig past the shared communal CSP compute options.

Virtual machine memory encryption reduces the implicit trust required in CSPs, because reading your system’s memory essentially becomes impossible. This can be dropped directly into most stacks with no change to code or execution speed, because it is a CPU level feature. GCP makes this available via #AMD EPYC Rome and Milan generation CPUs in GCP N2D and C2D machine types. Confidential VM Instances can be used as standalone Google Compute Engine (GCE) virtual machines or as Confidential Nodes within a GKE cluster (including within Anthos). These processors encrypt memory using AMD Secure Encrypted Virtualization (SEV) technology, which opens up their Secure Memory Encryption (SME) technology to KVM instances. Memory encryption keys are generated during VM creation and are not stored outside of AMD silicon, meaning they are not accessible to Google by design.

The exciting development from GCP is that existing technologies can then be combined innovatively, going beyond end-to-end encryption to achieve what Google terms #UbiquitousDataEncryption (#UDE). External on premises KMSs could already be connected to Cloud KMS via Cloud EKM, meaning GCP services can be made dependent on externally managed encryption keys. When confidential instances are combined with EKM, GCP cannot see ephemeral keys that are held in encrypted memory and requested for data de/encryption operations (which might be for block, object or file level storage). External KMS access can also be locked down to specific instances, in specific zones, in specific projects. This reduces the blast radius of GCP IAM breaches and is complemented by external KMS logging (available for SRE automation and alerting). VPC Service Controls could also play a part in limiting access to Cloud EKM. But the benefits of memory encryption key concealment could not previously be enforced as a requirement by the external KMS. Instance configuration choices are decided on GCP, with nothing to stop a non-confidential VM from successfully accessing the external KMS. A gap that GCP has now addressed.

The new magic comes from locking down the external KMS further, with EKM access made conditional on encrypted memory hardware making that request (initially only supported by #Thales CipherTrust Cloud Key Manager). Ephemeral storage-key access can thus be locked to encrypted memory instances, further reducing the implicit trust required of GCP. Key access policy can also be controlled separately by your on premises KMS for encryption and decryption operations (perhaps on premises workloads with unencrypted memory are trusted, for example). GCP has made a client tool available for such key operations, which is also directly integrated into gsutil for Google Cloud Storage (GCS) objects. Data lakes for AI/ML model generation are often ideally kept on #ObjectStorage, making this feature of direct importance to the private AI/ML clusters discussed above. For additional reassurance, a split trust approach can be deployed, where the secret is split into parts and authorisation must be granted by both GCP Cloud KMS and the external KMS via EKM. This overall UDE approach has been successfully deployed in the field, an example being Deutsche Bank.

We can see that we have solved the problem that we started with. Cloudscale AI/ML possibilities need not be held back by privacy, data sovereignty or security concerns. Scalable, on demand private GKE node pools can be deployed with encrypted memory in a specific choice of national jurisdiction, replacing communal AI/ML compute in the cloud. Cloud best practice can already meet regulatory expectations, but additional encryption protections offer 24 carat security for the culturally risk averse or for specific business needs. Implicit trust required of the CSP can be reduced substantially and control increased with UDE.

The next step is surmounting an unavoidable technological limitation of the solution.

Cloudnative catch 22: containerisation is a security block on ubiquitous data encryption

The complication we resolve below is that the GCP tool must be run on the confidential virtual machine as a root user. But running containers as root is not available on GKE or Anthos implementations. Best practice does not allow privileged containers, for good security reasons.

AI/ML workloads could instead be directly deployed on an autoscaled managed instance group (MIG) without containerisation to resolve this. This is a reasonable solution, but a real problem in the world of cloud native computing. #Kubernetes is required for #MLops workflows and orchestration by the likes of Kubeflow, offering substantial practical gains throughout the model lifecycle. Unlike conventional software that is merely built and deployed, models have to be curated constantly to retain predictive power. Sprawl of models and versions must also be managed relevantly.

For the curious: these root privileges are required because the hardware environment must be collated, especially to cryptographically attest that the requesting virtual machine is a confidential instance. We presume that this requires access to /proc/cpuinfo, /dev/tpm0 and/or /dev/tpmrm0. It was not immediately clear which are used for the ad hoc checks used in remote attestation (which are performed by Secure Nested Paging (SNP) after measured boot’s capture of SEV status on startup). There is the option of using GCP Ops Logging, which contains the vTPM attestation made at boot time. This is not appropriate in our confidential computing scenario. We need ad hoc attestation on request without reboot and GCP Ops still depends on implicit trust of CSP systems.

Getting the confidential AI/ML architecture right for cloudnative workloads

Secure containerisation is an obstacle to achieving UDE, so what can we do? As mentioned above, we could abandon the cloudnative gains of Kubeflow type #architecture and merely deploy models directly to MIGs. Where this is unacceptable, the answer we propose is to assemble very simple middleware, running an uncontainerised API on raw virtual machines in an autoscaling MIG. The MIG endpoints can be firewalled to only answer via GCP Internal Https Load Balancing (‘ing and not ‘er because GCP LBs are part of the SDN stack, and not balancers in themselves). This will allow other containerised workloads running on the private cluster to run UDE operations or request UDE protected object data on GCS, via REST or gRPC. Also adding a level of ‘security through obscurity.’

This approach maintains cloud scalability. The number of machines in the MIG will scale up and down, depending on actual load. It is important to note that non-batch data analysis scenarios are more sensitive to network throughput. Instance templates have to account for this through choice of machine shape for timely results, as maximum bandwidth ranges from 10 to 100Gbps. The configuration of the middleware API machines can be deployed by cloudinit, whether running bash, ansible, puppet or other scripts. The YAML configuration of the GCP UDE tool for encryption and decryption actions is held at ~/.config/stet.yaml on each managed instance (including whether split keys are used).

A ‘quick and dirty’ route to creating an API to UDE on this MIG is by opening two https endpoints listening for post requests. Logging middleware requests to each virtual machine’s syslog will ensure that they are automatically collected into GCP Logging for manual analysis or #SecOps automation. The command line tool itself must be used in the background by the API, as UDE functionality is not integrated into GCP SDKs at the time of writing:

The /stet endpoint would run the stet command from the GCP tool for direct encryption and decryption using UDE protected keys, which the client libraries on the encrypted memory virtual machine would access via Cloud EKM. The endpoint would have three parameters — an API key, the operator type (encrypt/decrypt) and the operand (a file in base64).
The /gsutilcpstet endpoint would run the gsutil cp command with the –stet flag, which automates object transfer from GCS and the UDE operation in a single command. The endpoint would have three parameters — an API key, the source and the destination. For decryption, the source is a gs:// object location and the API response will contain the file in base64. For encryption, the source is a file in base64 and the destination is a gs:// object location. Base64 encoding accommodates blobs, such as images. Navigation of buckets and objects can be executed on GKE nodes themselves, as keys are not necessary to view GCS names or metadata, only appropriate IAM privileges or Access Scopes are required.

There are some additional technical points for the engineer to consider. gsutil’s STDOUT currently cannot be piped, so it could be called in a bash sequence along the lines of:

destination="$(urlencode $source)"$EPOCHREALTIME; \gsutil cp --stet "gs://$source” $destination; \base64 $destination

STDOUT should be streamed to http while output is generated and not blocked until completion. Otherwise memory usage will balloon, because the base64 encoded version will be held in memory until fully generated and ready to serve (data archives may be very large). Relevant modules appear to exist for node and golang, for example (these should be assessed for appropriate memory behaviour i.e. that they do not effectively override the maximum shell screen buffer with their own buffering). You also need to make sure that the daemon serving the endpoint does not impose a maximum http response size. An additional endpoint or another approach could also be considered to allow for multiple file transfers or gsutil rsync (wildcard download and decryption in the above approach would fail at the base64 encoding stage). You could also consider creating an external service entry in a service mesh such as #Istio/Anthos SM (pointing at the GCP load balanced endpoints). Or you could replace GCP load balancing entirely, install Istio on the VMs and take advantage of the increased security of Istio load balancing with #mTLS (which is not available on GCP LBs).

The API keys to access the middleware can be stored and managed in GCP Secret Manager, to which read access can be locked down to only the service accounts used by 1) the API middleware managed instances to be protected and 2) the GKE node pools that are allowed to access the middleware. Care must be taken to make sure that policies do not allow impersonation of these service accounts and VPC Service Controls can again be used. There is a balancing act to be had here — it is the only point where implicit trust of GCP is re-escalated to normal levels by use of a CSP service. But with the real bonus of locking down access to only the allowed GKE node pools via their service accounts. Put bluntly, it is also worth noting that security of GCP secret and key managers is absolutely existential to the credibility of Google’s platform. Alternatively, third party secret managers could be integrated, perhaps including hardware based security modules (HSMs).

An additional factor to architect for is that GKE confidential nodes do not currently offer T/GPU hardware. This is less of a problem that it seems. Not all machine learning models benefit heavily from hardware acceleration for training (deep learning is the prime exception). This means confidential nodes remain appropriate for training and serving many types of models.

Where T/GPUs are unavoidably needed, non-confidential preemptible node pools with accelerators could be used for deep learning model training. This use of unencrypted memory instances is not a deal breaker for a number of reasons — dilution of UDE gains by non-confidential nodes can be minimal, temporary and with a limited blast radius. Firstly, preemptible nodes are as ephemeral as possible (lasting maximum 24 hours, compared to spot or normal nodes lasting longer). You cannot predict in advance which GCP bare metal machines the preemptible instances will be booted on and they will constantly change with machine preemption. This undermines many forms of attack. To enhance this feature, nodes could also be drained on a random regular schedule and their instances then directly deleted, causing them to be recreated automatically by the GKE MIG. Secondly, T/GPU non-confidential node memory is still protected by the Linux kernel memory protections outlined above, which is exceeded by Container-Optimized OS running on GKE nodes. Thirdly, training data rarely needs to contain personally identifiable or other confidential information. If necessary, these can also be redacted with GCP’s Cloud Data Loss Prevention API (DLP), which meets gruelling regulatory and industry standards, third party offerings or open source solutions. Fourthly, even models that must benefit from T/GPU training might be viably served from non-accelerated high performance confidential nodes. Those models that can then be served from confidential nodes are unaffected, in that they are still served from encrypted memory.

The above can undoubtedly be improved. Any suggestions, observations and identification of gaps or errors will be gratefully received.