Offloading compute from my laptop to Cloud

Boris Berenberg
9 min readSep 23, 2019

--

I was disappointed to learn that offloading compute to a cloud provider as a cost saving and flexibility providing measure introduces a fair amount of work, and saves very little money (if any).

Most of my work is on Atlassian products like Jira and Confluence. We’re either testing functionality on behalf of a consulting client, or building apps (plugins) which we sell on the Atlassian Marketplace. Depending on the task I am performing, the memory on my local device can be a limiting factor, and I may be thrashing CPU and disk I/O as well. My laptop is a 2017 Macbook Pro with 16gb of ram. Even after Slack improved their memory utilization and I switched to Safari, the system will bog down at times.

I have been interested in the idea of switching to a Chrome OS laptop (or an iPad Pro with a keyboard) for a while now for the sake of security, cost, battery life, lower requirements around fleet management and backups, size and weight, and cool-factor. However, I have been wary of the dependency on an internet connection, and on the various issues associated with remote compute.

There have been many articles written on the topic:

And while they all seem to work well for environments that have robust CI/CD pipelines, micro-services, and the all the latest sexiness, that isn't what my day to day looks like, so I wanted to try it out for myself.

Why am I doing this

The point of this is to simulate a Chrome OS (or similar style) of environment on my primary machine. Offload the compute to a cloud provider, and then evaluate this over the span of a few months. I received a $500 credit to Google Cloud Platform (GCP) earlier this year, and as it will expire in Feb, I thought this would be a good way to use it.

Concerns I had:

  • From unlocking my computer to having compute ready to go, how long would it take?
  • From the moment I finish work, till everything is cleaned up, how long would it take?
  • Can we set up something to defend against my forgetting to turn it off?
  • What if I run out of storage space?
  • How do I deal with variable IP addresses?
  • How do I deal with variable port usage?
  • How do I securely get my SSH keys to this box as it will likely need to act as a jump host to other machines I normally work with?
  • How will my IDEs (VSCode and IntelliJ) work with this?
  • What does security look like for a system like this?
  • How does network latency and speed my ability to use this?

How does this actually work day to day?

I have set up an on-demand, variable compute power, remote workstation on GCP. Once you know what you’re doing, this is actually surprisingly simple, but Google did not make it easy to figure out the best approach.

However, I did get to the point where all I need to do to set up, and tear down the environment I have designed is:

rd30
exit

But what exactly does this do, and how does it do this?

This spins up a virtual machine with 30 GB of memory for me to use, and on exit, it cleans everything up for me.

Let’s jump back to the questions I listed earlier:

  • From unlocking my computer to having compute ready to go, how long would it take?

From the time I run the start command, till I am SSHed into the box is typically 50–70 seconds. If I switch my disk to SSD, it seems to save anywhere from 7–15 seconds.

Periodically, the startup time of the VM is longer than expected, and this results in the SSH command timing out, and then the delete command running. This seems to be an issue on the GCP side as it happens for an extended period (a week) and then resolves itself (capacity? bugs? who knows?). Here is the error:

Created [https://www.googleapis.com/compute/v1/projects/gcp-desktop/zones/us-east4-c/instances/remote-desktop].
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
remote-desktop us-east4-c n1-standard-2 10.150.0.54 35.199.57.192 RUNNING
ssh: connect to host 35.199.57.192 port 22: Connection refused
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
  • From the moment I finish work, till everything is cleaned up, how long would it take?

Teardowns take ~2 minutes and are effectively a hard shutdown. The shutdown command is issued near instantly, so this 2 minutes is the time for my terminal to get a confirmation of deletion back from Google. In practice I can close my laptop instantly, and let the deletion continue in the background.

My workload is an interesting one in that nothing really matters too much in it. Everything can be rebuilt very easily. If data gets corrupted, it’s unlikely to cause any major issues for me. This means that I can tear the machine down and not care about the status of PostgreSQL, Lucene, or any number of other bits of code that are running on my system.

If I need to disconnect from the system without shutting down the whole instance? I can kill the terminal and the delete command will never execute.

  • Can we set up something to defend against my forgetting to turn it off?

I have opted to avoid doing this via using auto scaling, and custom metrics due to the complexity involved. Following on from my SSH auto tear-down approach, I have configured an SSH timeout that is reasonable for my use case.

ClientAliveInterval 120 // Send keep alive every 2 minutes
ClientAliveCountMax 120 // 4 hours of inactivity
TCPKeepAlive yes

Unfortunately… This doesn’t seem to work, and I don’t know why. #tofix

  • What if I run out of storage space?

Google makes it easy to re-size my persistent volume.

  • How do I deal with variable IP addresses?

To make this easy to use in terms of web tech, I registered a domain with Google Domains, set up dynamic dns using ddclient (Make sure to use the protocol=googledomains example), and force ddclient to run on both startup, and on login. I also set up SSL using Let’sEncrypt with Certbot.

  • How do I deal with variable port usage?

I have decided to proxy everything through Nginx. I am sure you could open ports, or do something else, but within moments of bring this up Nginx was being hammered by exploit attempts. I believe this was triggered by the Let’sEncrypt registration, but I like knowing that Nginx is taking the brunt of it instead of the various apps I am testing with.

  • How do I securely get my SSH keys to this box as it will likely need to act as a jump host to other machines I normally work with?

The parameter --ssh-flag="-A" to the gcloud ssh command handles this for me

  • How will my IDEs (VSCode and IntelliJ) work with this?

VSCode needs nothing special. I followed the instructions and it worked easily after setting up my SSH config like so

Host <HOSTNAME>
User <username>
IdentityFile ~/.ssh/google_compute_engine
StrictHostKeyChecking no

For IntelliJ I have not figured out how to do this.

  • What does security look like for a system like this?

I took a fairly lax approach by simply using a modern os, enabling automated updates, keeping the system offline most of the time, keeping most network ports blocked, only using key based ssh access, and avoiding putting much sensitive material on the machine. It does host some of our source code that I am using in testing, but I honestly don’t see much risk in this being attacked. We’re a niche vendor selling in a small marketplace, and it would be trivial to identify someone who had stolen our code.

If you know that you’re going to only be accessing the system from a set of IPs, then you can add that to your firewall rules.

  • How does network latency and speed my ability to use this?

Some tasks are made much faster, for example pulling down dependencies due to the instance being on a much better connection than my home system.

Riding the Caltrain with the laptop tethered to my Iphone Xs Max with AT&T was terrible and I couldn’t get the VM started. I am considering switching from ssh to mosh but have not tried this out yet.

How to set it up

The following assumes you have the gcloud utility already installed. I found it easier to work with the tool than to use the UI in many cases. We create a project called remote-desktop

gcloud projects create remote-desktop

Then, we create our base VM, the important thing here is to make sure you pick the right region, get the disk size you want, and that you use the base image that you care about. In my case, this meant a 100GB Ubuntu 18.04 LTS configuration located in US East.

gcloud compute --project=gcp-desktop instances create remote-desktop --zone=us-east4-c --machine-type=n1-standard-1 --subnet=default --network-tier=PREMIUM --maintenance-policy=MIGRATE --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --tags=http-server,https-server --image=ubuntu-minimal-1804-bionic-v20190723 --image-project=ubuntu-os-cloud --boot-disk-size=100GB --boot-disk-type=pd-standard --boot-disk-device-name=instance-1

The next step is to find a list of all disks, and then snapshot the disk which was created with this instance, and then convert it into a Persistent disk.

gcloud compute disks list
gcloud compute disks snapshot remote-desktop // This is the name by default
- Pick the region you used above
- Look for the snapshot id in the string that looks like:
-- Creating snapshot(s) t3fy25enr88j...done
gcloud compute disks create persistent-disk --source-snapshot t3fy25enr88j // If you want a standard disk volume
gcloud compute disks create persistent-disk --source-snapshot t3fy25enr88j --type=pd-ssd // If you want this to be an SSD volume

With this completed, it’s time to tear down the previous VM, and the intermediate snapshot you created.

gcloud compute snapshots delete t3fy25enr88j
gcloud compute instances delete remote-desktop-30

And now start your instance with your newly minted persistent disk, this time with a larger instance size.

gcloud compute --project=gcp-desktop instances create remote-desktop-30 --zone=us-east4-c --machine-type=n1-standard-8 --subnet=default --network-tier=PREMIUM --maintenance-policy=MIGRATE --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --tags=http-server,https-server --disk name=persistent-desktop-disk,boot=yes

Because I work with web technologies, I also needed to ensure that we allow network traffic to the instance on standard HTTP and HTTPS ports.

gcloud compute --project=gcp-desktop firewall-rules create default-allow-http --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:80 --source-ranges=0.0.0.0/0 --target-tags=http-servergcloud compute --project=gcp-desktop firewall-rules create default-allow-https --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:443 --source-ranges=0.0.0.0/0 --target-tags=https-server

But this doesn’t address the various issues we had above, so I am going to show my current approach to this, and then break it down for you:

gcloud compute --project=gcp-desktop instances create remote-desktop-30 --zone=us-east4-c --machine-type=n1-standard-8 --subnet=default --network-tier=PREMIUM --maintenance-policy=MIGRATE --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --tags=http-server,https-server --disk name=persistent-desktop-disk,boot=yes; gcloud compute ssh --ssh-flag="-A" remote-desktop-30 ; gcloud compute instances delete --quiet remote-desktop-30

This single command is aliased to start a 8 core vm with 30gb of ram, automatically SSH to the newly created instance, and pass along my local ssh keys using key forwarding. Then on exit, it will automatically tear down the machine.

I have set up similar aliases with varying instance sizes. This means I am also reducing my spend if I don’t need to run big / complex jobs.

Costs

My estimated monthly costs are:

  • 100 GB general storage $4 (I can upgrade to SSD for $17/month)
  • 22 working days per month * 4 hours per day * n1-standard-8 = ~$37.19

August numbers in practice were $33.45 (and that was including me not shutting things down properly sometimes, and using smaller instance sizes often).

But could this work for a company at scale? If you gave your engineers $1000 Chromebooks, and remote dev systems like this, how much could you save vs a Macbook Pro?

Napkin math:

  • $3,000 (MBP) / 3 Years (My estimate on how long corporate MBPs last) = $1,000 / year
  • $999 (Pixelbook) / 3 = $333 / year
  • $1,000 - 333 = $666 margin remaining for GCP
  • $40 (monthly GCP spend) * 12 months = $480 per year per employee for remote compute

So this works out to ~ $186 per year in savings. At the cost of staff having reduced ability to work if they don’t have access to strong internet connection.

However, you do gain all the benefits I mentioned earlier on in the article re: security, battery-life, and being less tied to one device / requiring less robust backups.

Next Steps

Some things I am interested to explore over the coming months

systemd-analyze blame
5.251s postgresql@10-main.service
1.778s cloud-init-local.service
1.458s dev-sda1.device
1.396s apport.service
1.365s nginx.service
1.332s snapd.service
1.241s systemd-networkd-wait-online.service
1.210s postgresql.service
1.203s lvm2-monitor.service
1.198s cloud-init.service
1.105s ddclient.service
1.038s google-instance-setup.service

--

--