Reducing costs with Storage Transfer Service from Amazon S3 to Cloud Storage (S3 to GCS)

Aman Puri
Google Cloud - Community
9 min readNov 26, 2022

A very important aspect of cloud is essentially migration. And it is not only about what data is getting migrated, but also how it is getting migrated. The major challenges come with this are how the data is sent out from a cloud data center like AWS and the costs occurred due to it. Considering this we will dive deeper on how we can migrate objects from sources like Amazon S3 to Google Cloud’s Cloud Storage solution.

What is Storage Transfer Service?

Storage Transfer Service helps transfer data quickly and securely between object and file storage across Google Cloud, Amazon, Azure, on-premises, and more.

What is new in Storage Transfer Service?

One of the newest source options which is added in is called S3-compatible object storage. This is used for any object storage sources which are similar to the Amazon S3 API. Here, instead of mentioning just the Amazon S3 bucket name, you need to also provide the endpoint where the bucket is located. This will enable us to use S3 interface endpoints which enable us to access S3 in a private way, provided we have a VPN/Interconnect connection between AWS and GCP networks. For using this as a source, you need to setup agent pools which run as containers which could be on Google Compute Engine with Docker installed or Cloud Run or GKE. A recommendation is to deploy the agents closer to the source, so even deploying agents on Amazon EC2 will work as well.

GCP portal showing the S3-compatible object storage as a Source type while creating a transfer job

Why do we need this when we have Amazon S3 as a source?

Currently, the Amazon S3 source based transfers transfer the data to GCP Storage using the internet. There are two caveats to this:

  1. While the throughput is strong and the data transfer is an encrypted transfer, there is always a concern that arises from internet based transfers. This option cannot help us tunnel the transfer through VPC/Interconnect channels.
  2. Egress costs of AWS during data transfer is also a major impact during large amounts of object migration. There is a huge impact of cost on AWS side and there is a solution needed to mitigate the cost. Interconnect setup helps in providing discounts for AWS egress charges depending on the vendor chosen provided the data is coming out of AWS via interconnect to GCP.

Considering the above scenarios and use cases, we can leverage S3-compatible object sources.

How does it work?

Storage Transfer Service accesses your data in S3-compatible sources using transfer agents deployed on VMs close to the data source. These agents run in a Docker container and belong to an agent pool, which is a collection of agents using the same configuration and that collectively move your data in parallel.

Because of this, you can create an interface endpoint for Amazon S3 in AWS, which is a private link to S3 and migrate data from the S3 bucket to GCS. Doing so requires an interconnectivity between AWS and GCP using VPN tunnel or an Interconnect tunnel. S3 bucket will transfer the data to the agent which will be in GCP via the private channel and then the data is stored in the GCS bucket. Optionally, we can enable Private Google Access in VPC to ensure data from agent VM to Cloud Storage is also running through a private connection.

Flow of data transfer using Storage Transfer Service with S3-compatible object source using Amazon S3

Steps required to setup the environment and initiate data transfer

The prerequisite are as follows:

  1. Ensure there is connection between AWS and GCP environments using a VPN or Interconnect.
  2. Enable the Storage Transfer API.
  3. Have a GCP user account/service account with Storage Transfer Admin Role. If you are using a GCP service account, you need to export the service account key into a JSON file. You will have to store this service account key file in the Compute Engine instance where the agents will be installed.
  4. Have AWS Access Key ID and Secret Access Key handy. It should at least have read permissions on the S3 bucket.
  5. Set up a Compute Engine in the same GCP VPC which is connected with the AWS environment using VPN/Interconnect and run the following commands (Ensure gcloud is installed in the Compute Engine)
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo systemctl enable docker
sudo docker run -ti --name gcloud-config google/cloud-sdk gcloud auth application-default login

The above commands will install docker in the Compute Engine and will allow you to authenticate Docker with your GCP credentials.

5. Create an Interface endpoint for Amazon S3 in AWS. This will give an endpoint to us which looks like vpce-xxxx-xxxx.s3.*aws-region*.vpce.amazonaws.com. You can use this link to create a VPC endpoint. Note that we need an endpoint for the S3 service and we need an interface endpoint, not Gateway endpoint attached to the AWS VPC which is tunneled with GCP.

6. Test the endpoint if it is accessible via GCP on port 443. You can use telnet as follows:

telnet *bucketname*.bucket.vpce-xxxx-xxxx.s3.*region-name*.vpce.amazonaws.com 443

You should immediately get a response similar to this:

Trying 172.31.35.33...
Connected to *bucketname*.bucket.vpce-xxxx-xxxx.s3.*region-name*.vpce.amazonaws.com.
Escape character is '^]'.

This verifies that the tunnel established is working fine and you can use the private link to access the S3 bucket from GCP. If you do not receive this output, kindly check and verify the tunnel connection is working properly.

Note: Notice that the endpoint has a word bucket after bucket name and before the endpoint url. This will be embedded when you input the endpoint during the transfer job creation. So, the endpoint will look like bucket.vpce-xxxx-xxxx.s3.*region-name*.vpce.amazonaws.com.

Installing the Transfer agents

  1. In the GCP console, navigate to Data Transfer > Agent Pools. Choose to Install Agents in Default Pool or you can Create another Agent Pool. Creating another pools helps us to setup custom details and set a bandwidth limit if required for transfer. However, I would recommend to go with default as they help in creating Pub/Sub topics for communication in Step 2
Data Transfer page to create agent pools

2. After selecting the Pool, click on Install Agent. You will see a pop-up at the right. Click on Create button at the section which mentions about creating Cloud Pub/Sub topics and subscriptions, which help the Storage Transfer Service to provide updates on job creation/progress/updates. Note: This option is currently only visible if you choose the transfer_service_default pool, but it will help later if you create new pools as well as this is a one-time task.

Create button to create the Pub/Sub topics and subcriptions for the Transfer jobs

3. You will also see options below for parameters. Set Storage Type as S3-compatible object storage. Number of agents to install provides the number of docker containers that will run in the agent VM. Note that the higher the number of containers, higher the CPU utilisation would be. Agent ID prefix is to set a prefix for the container name. You will also see a parameter for inputting the Access Key ID and Secret Access Key for the AWS credentials. Expanding the advance settings, you can choose the default GCP credentials or use the Service account file and mention the absolute path of the service account key file. All these options generate a set of commands showing withan example like this:

export AWS_ACCESS_KEY_ID=XXX 
export AWS_SECRET_ACCESS_KEY=XXX
gcloud transfer agents install --pool=transfer_service_default --id-prefix=demo- --creds-file="/home/user/sa.json" --s3-compatible-mode

Note: I have added export before the AWS environment variables as not putting the export keyword was not allowing the agents to utilise the AWS credentials. Since this feature is pre-GA as of now, this issue can occur for some users.

4. Run the above commands in the VM where you installed docker. This will run the containers and the agents inside the container. You can check the containers by using the following command:

sudo docker ps

With this, your environment is ready to start data transfer from Amazon S3.

Creating a transfer job

  1. In the data transfers, create a transfer job and choose Source type as S3-compatible object storage and click next step.

2. In Choose a source, select the agent pool where the agents are installed, in Bucket or folder type the name of the S3 bucket. In Endpoint, endpoint should be in the format bucket.vpce-xxxx-xxxx.s3.*region-name*.vpce.amazonaws.com. In the Signing region, type the AWS region where the bucket resides, and then go to next step.

3. In Choose a destination, choose your destination bucket and the folder path.

4. In the next steps, you can choose if you want to run this job immediately or on a schedule. Optionally, you can check the settings similar to normal Transfer jobs like overwriting objects or deleting source objects,etc.. Once you have your desired settings click on Create to initiate the transfer.

5. You can click on the job and check the status of the transfer. You will notice that the objects will be discovered and checksummed and the data transfer will be in progress depending on the size of the objects in the source bucket.

Any current limitations?

  1. The speed of the transfer will depend on how the maximum bandwidth set for the pool, the size of the compute engine and the maximum network throughput it supports and the maximum bandwidth capacity of the VPN/Interconnect tunnel
  2. While the egress cost of AWS can become less provided the Interconnect setup provides AWS egress discounts, the trade-off cost would be the agent VM’s running cost. However, that cost is not as high as the egress cost.
  3. Object reads could be a limit for how the fast the transfer would go. The recommendation here is include multiple prefixes (or folder paths) in multiple jobs while doing large data transfers.

Conclusion

Using the S3-compatible object storage as a source option for Amazon S3 buckets, we can now tunnel the transfer via VPN/Interconnect and transfer or migrate objects to Cloud Storage at large volumes. Using interconnect will help to save a lot of AWS egress cost and make the transfer more secure.

Thanks for Reading! 😊

References:

You can read about the S3-compatible object sources in the official doc

Storage Transfer Service Release notes

AWS Private link for Amazon S3:

How to setup VPN between AWS and GCP

--

--