Sitemap

Skyplane: 110x faster data transfers on any cloud

5 min readOct 5, 2022

--

Co-written with

and

Data transfers in the cloud are slow and expensive. Transfers using typical CLI tools like aws s3 cp or rsync lead to transfer speeds as low as 20MB/s (slower than the US average broadband speed). Moreover, cloud transfers are very expensive; it can cost up to $14 to copy a 100GB dataset due to the cloud’s egregious egress fees.

When transferring a 64GB dataset, Skyplane is 160x faster than rsync and 110x faster than AWS DataSync

Skyplane is an open-source developer tool for transferring data across cloud object stores. Skyplane is 164x faster than rsync and 113x faster than AWS DataSync. What would have taken nearly half of a working day now takes less than two minutes. That’s less time than it takes to read this blog!

How can we make transfers faster and cheaper?

For the last decade as researchers at Berkeley, our research to optimize the performance of data-intensive workloads led to Apache Spark and the Ray project. These systems were designed at a time when datasets predominantly lived in a single region in a single cloud. Increasingly, the bottleneck is data transfer between cloud regions and cloud providers.

We think data transfer should be faster, cheaper and universal across any cloud. Our lab’s research in cloud computing and networking led us to build Skyplane, which is:

  • up to 110x faster than the best cloud transfer services like AWS DataSync
  • up to 3.8x cheaper than existing free tools
  • universally supported across all three public clouds (AWS, Azure and GCP)

Let’s move 70GB using AWS S3’s CLI (22MB/s):

Now, let’s move 70GB using Skyplane (35Gb/s):

Moving data across clouds today

There are a variety of tools that help transfer data across the internet between cloud regions or cloud providers, however each come with limitations:

  1. rsync offers a simple CLI tool that copies data from one VM to another while applying delta compression to avoid redundant copies of data already found at the destination. Similarly, cloud providers offer CLIs like aws s3 cp or gsutil cp.
  2. Cloud data transfer services like AWS DataSync and GCP Data can leverage cloud elasticity to run high throughput transfers. These services typically prioritize supporting data transfers into the provider’s cloud, lacking universality. They may also charge high egress and service fees (e.g. $1.25 per 100GB moved for DataSync).

Our goal with Skyplane is to provide a cloud data transfer tool which is (1) fast with real-world high transfer rates, (2) low-cost to minimize the impact of egregious egress fees and (3) universal by supporting all major public clouds.

🔥 Blazing fast: 110x faster transfers between clouds

Let’s move a 64GB dataset as a single large file between two regions in AWS (ap-southeast-3 to eu-west-3). We can do this using the Skyplane CLI with:

$ skyplane cp -n 8 s3://srcbucket/64gb.tar.gz s3://dstbucket/

The result? 110x faster transfers than AWS DataSync and rsync (show in first figure).

Skyplane optimizes transfer speed using networking research from UC Berkeley. Skyplane creates an overlay network on top of the clouds so it can automatically route around congested cloud network links. It also utilizes parallelism by striping data transfers over many pipelined TCP connections as well as multiple VMs. We’ll present a deep dive into the internals of Skyplane in an upcoming blog post. Learn more about Skyplane’s architecture on our docs.

💸 Low cost: save up to 6.2x cost on egress

Data egress fees dominate cloud costs with egregious egress charges from all three public clouds. To move a 220GB dump of English Wikipedia directly using rsync from a single VM, it would cost $4.47 in egress charges alone. Beyond egress fees, tools like AWS DataSync charge a service fee for transfers, thereby raising the cost to $7.27 total.

When transferring a 220GB Wikipedia dump, Skyplane is 3.8x cheaper than rsync and 6.2x cheaper than AWS DataSync

By using Skyplane, the 220GB transfer would cost just $1.17. This is 6.2x cheaper than AWS DataSync at an average egress rate of $0.005/GB.

Skyplane is free and open-source so results in no added service fee unlike services like AWS DataSync. Skyplane reduces egress cost by transparently compressing your data without slowing down your transfer using LZ4 compression. In addition, Skyplane’s optimizer automatically selects the best network route to minimize cost while ensuring good throughput.

🌐 Universal support for any cloud

Skyplane is designed to be cloud agnostic, and currently supports transfers across AWS, GCP, and Azure. Existing data transfer platforms offered by the clouds do not support all sources and destinations.

In case Skyplane doesn’t support your cloud of choice, additional clouds providers can be added simply by integrating with their authentication and object store APIs. It’s easy to support a new cloud; submit a pull request in less than 500 lines of code to add a new cloud provider.

Skyplane supports transfers between any major public cloud to any other major public cloud

Get started with Skyplane in 5 minutes

⚠️ Note: Skyplane is under active development. Expect sharp edges!

Skyplane is open-source on Github and is a community supported project. After logging into clouds via the AWS, GCP or Azure CLIs, Skyplane takes less than 5 minutes to setup.

# install skyplane and set it up with clouds
$ pip install skyplane
$ skyplane init
# copy files from an AWS S3 bucket to a GCP GCS bucket
$ skyplane cp -r s3://… gs://…
# copy changed files from S3 to GCS
$ skyplane sync s3://… gcs://…

Let us know what you think

Do you have challenges working with big datasets in the cloud? We are actively interviewing data teams already working in the cloud to learn about the challenges around data gravity. Send us an email at hi@skyplane.org or join our Slack.

Help us solve data gravity in the cloud

Skyplane is actively being developed, and we’d love to get community contributions to improving performance, security, and adding support for additional cloud providers.

We also have additional projects for:

  • Supporting on-prem to cloud data transfers
  • Supporting local transfers (VM => VM)
  • Enabling persistent bucket synchronization
  • Adding an API interface for programmatic transfers

Join our Slack to get involved!

Thanks to Andy Konwinski, Ion Stoica, and Daniel Kang for providing feedback on earlier drafts, and Jason Ding for running benchmarks for this blog.

https://github.com/skyplane-project/skyplane

--

--

Responses (5)