Some misadventures in the cloud.
- I’m a former Googler and ran product for our Datacenter Software team, so I’ve gotten to get the inside-track on Google Cloud’s inner workings, but I still find the current set of serverless offerings frustratingly limited when looking at options for my startup’s backend services.
- Having run a colo facility serving hundreds of clients and having built my first company with servers I hand-assembled, I understand the pros and cons of running my own servers. And, generally speaking, I’d like to avoid having to get paged in the middle of the night with on-call iron issues. I trust Google’s security team deeply to help keep my users’ data safe.
- Can’t keep functions hot. Cold startup performance with Google Cloud Functions is pretty bad (~8s in our experience) and functions cool very quickly — one third party developer estimated 50% of functions are cold by 15mins after a call. For a small app without a huge number of users exercising every function regularly, that results in a crappy user experience. There’s some work we could do on our side to reduce this, but there aren’t options to “pay to keep hot”. Note that AWS Lambda has a “Provisioned Concurrency” option whereby you can pay a small premium to keep some instances hot. There’s no equivalent in Google-land. This leads people to develop wacky hacks like “once every X minutes pubsub ticklers” to keep functions hot where there’s sensitivity to startup latency.
- Can’t run long operations serverless. There aren’t good serverless options for long-running operations. It’s unclear to me why Cloud Run and GCF have such severe caps on process duration (15min and 9min maxes, respectively); this makes an architecture that involves long-running operations (e.g. audio streaming). Note that Fargate offers an ability to run serverless containers without execution time limits on AWS, and further allows for Fargate Spot (on fault-tolerant instances with 70% discount) and EKS (for serverless auto scale-out Kubernetes clusters on Fargate). Note that some have apparently found a fascinating workaround by using Cloud AI Platform Training to run containers with no maximum runtime cap(!) but this seems a “hack” and a loophole likely to be closed if it’s seeing abuse.
- Low RAM Caps. Cloud Run, App Engine Standard, and Cloud Functions all have 2GB RAM caps on processes max. You can’t go over this. That’s an enormous gap from Fargate, which offers up to 30GB serverless containers.
- Low CPU Caps. Cloud Run, App Engine Standard, and Cloud Functions don’t allow more than two vCPUs of compute. (App Engine offers a “B8” instance with “4.8GHz CPU Limit” but it’s unclear how this maps to vCPUs.) Fargate offers up to 4 vCPUs.
- Can’t Mount Cloud Storage. Cloud Run, GCF, and App Engine don’t support mounting Cloud Storage buckets via a FUSE endpoint. This means that you have to use the client libraries to open read/write streams, or pull the file from Storage into the local /tmp tmpfs, which eats into your very limited RAM quota. Fargate supports FUSE, allowing you to mount S3 and supports EFS mounts out-of-the-box and as of two days ago(!) even Lambda offers EFS mounts.
- No APIs for common long running operations. The above would be mitigated if some of the common long-running operations (stream processing, audio redistribution, transcoding) were offered as first-class services under Cloud. Google offers audio streaming APIs with Cloud ML Speech for streaming transcription, but handing clients a transcription API key seems like a terrible idea (which is what the LiveTranscribe and InfiniteStreaming Google source demos do) which means we need to proxy the audio, but Google doesn’t offer a way to actually do this without firing up an instance with GCE. Given Google’s serverless limitations around task length addressed above, it becomes especially important to “make up” for that with services that can perform long-running operations. AWS offers Elastic Transcoder and MediaConvert for transcoding, for instance. (There’s a Google Cloud Solutions for Media & Entertainment which claims to offer transcoding, but this isn’t behind an obvious API short of engaging an enterprise solution.)
- Uneven Polling vs PubSub: Google Cloud ML Speech in batch mode (longRunningTranscribe) requires polling and doesn’t use PubSub to update you on progress and completion.
- RealtimeDB & HIPAA. Firebase RealtimeDB is still best-practice for mobile app backends vs Firestore, but isn’t HIPAA-Compliant. While Medcorder is entirely patient-driven and therefore outside HIPAA jurisdiction, we’d like to follow the same security guidelines as if we were subject to HIPAA. This means we’re going to need to port to Firestore to achieve HIPAA compliance, which is going to be a pain. I don’t know why Google wasn’t able to get RealtimeDB HIPAA-certified and that lack of effort is going to cost my startup precious opportunity cost. Boo.
- I’m all in on Google but I wish Google was all-in on serverless and had better offerings to help startups like ours thrive.
- If I missed anything here (very likely!) and it turns out Google does have good solutions for my problems, please give me a shout at firstname.lastname@example.org — I’d love to get the correction (and will update here)!
UPDATE June 19, 2020
- A number of Google Cloud Product Managers reached out with some fantastic news; almost all of the above are on the roadmap already. Hooray! If you’re interested to sign up, there are forms to enroll in alphas for min-instances, 60 minute operations, 4GB RAM / 4 vCPU Cloud Run instances, Cloud Run Streaming, Cloud Run Graceful Shutdown. I feel like a kid in a candy store!