Low Latency File Transfers with PayPal DropZone

Subramaniajeeva K
The PayPal Technology Blog
6 min readMay 4, 2021

Reducing latency for inter-continental file transfers

Imagine you are in a critical video call with people around the globe, but the moment you start speaking, your teammates hear nothing but a broken, robotic voice on your line. How frustrating is it to you and to the folks at the other end of the call to experience this situation?

These jitter or latency issues are not new to us. They may not even be related to the underlying software, but more on the way the network is set up.

At PayPal, as we always want to deliver the best customer experiences with utmost quality, we solved quite a few latency issues by optimizing the network setup using CDNs and Edge computing. We now have Edge servers set up all around the globe serving web pages with low latency.

This article covers how PayPal DropZone, a secure file transfer platform, improved the latency and availability of file transfers using the existing PayPal Edge infrastructure.

Demystifying speed

Speed is a feature, and it comes with a high cost. To deliver high speed, we need to understand the physical limitations at play.

Multiple factors determine the network’s speed, and bandwidth and network latency play a significant role.

Bandwidth

Bandwidth is the maximum amount of data that can be put on the wire at a time.

Bandwidth, also referred to as network data rates, is typically measured in bits per second (bps), whereas data rates for non-network equipment are usually shown in bytes per second (Bps). This is a common source of confusion; pay close attention to the units.

For example, to put a 10 megabyte (MB) data “on the wire” over a 1Mbps link, we will need 80 seconds. 10MB is equal to 80Mb because there are 8 bits for every byte.

Network Latency

Network latency is the time taken for a packet to travel from the source of origin to its destination. As most of the protocols are built on top of the TCP stack where packets take a round trip, latency is generally measured in association with round-trip time (RTT).

Network latency is closely related but varies slightly from RTT, and RTT is not always double. The forward path of a packet from source to destination may be different than the return path, leading to varying latencies. Hence RTT plays a critical role in defining performance.

RTT = forward latency + return latency + processing time

Dissecting network latency

The speed of data transfer has a physical limit. It cannot go beyond the speed of light. The fastest mode of data transfer is through optical fibres, which primarily depends on the speed of light.

Speed of light in vacuum : 3 * 10⁸ m/s

Speed of light within optical fibre : 2 * 10⁸ m/s

Dissecting this further : 2 * 10⁸ m/s → 200,000 km/s → 200 km/ms

Distance between Sydney to LA (shortest route) : 12500 kms

Latency in packet transfer : 62.5ms (12500 / 200)

Round Trip Time : 125ms

Transferring a single TCP packet from Sydney to Los Angeles takes at least 125ms using optical fibre. With a max TCP packet size of 64KB, the number of round trips required to transfer a 1GB file would be 15625; in other words, it would take at least 33 minutes to transfer a 1GB file.

Last Mile Latency

It is often the last few miles, not crossing the oceans or continents, where significant network latency is introduced.

Imagine a truck carrying goods from New York to San Francisco. For most of the journey, the truck can travel on the interstate at highway speeds. Once it reaches San Francisco, it has to exit the highway and get into the congested local roads to get to the location, which will be the most time-consuming affair in that journey.

Applying the above analogy to networking, even though the packets travel with a high speed through the ocean, the transfer rate considerably reduces before it reaches the destination. It may be because of various factors like congestion, routing method, network topology, and increased hops.

For example, here are the traceroute results connecting to Google from one of our servers:

(i)Traceroute to Google server

The routing process of packets took 20ms from our server to reach its final destination more than 740 miles away. With the same time, it could have crossed most of the continental USA.

Amdahl’s Law: a system’s speed is determined by its slowest component.

Though we cannot increase speed between continents because of many physical limitations, we can make critical improvements to reduce the last mile latency and the number of round trips.

DropZone Network Re-Architecture

PayPal DropZone is a secure file transfer platform used to transfer files across and outside PayPal in a secure way. It is built on top of open-source software called Apache Mina, which supports protocols like SFTP and SCP.

As a part of our data center transformation (DCX) journey, we migrated DropZone from one data center (DC1) to another (DC2). We wanted to enhance our architecture to provide low latency and better availability in file transfers.

DropZone server relocation diagram
(ii)DropZone old architecture

As you can see in the previous architecture, file transfers were happening through our servers in DC1. So certain files from various locations had to travel through the internet to DC1, which incurs a considerable last-mile latency both at the source and destination.

(iii)Last mile latency from User to ISP and ISP to PayPal server (DC1)

Introducing Edge and POP servers

PayPal’s Edge infrastructure has POPs and Edge servers distributed across the globe, backed by a private ISP/backbone, connecting our data centers worldwide. Private ISPs are generally highly efficient as they provide better reliability with the least hops and reduced network latency.

To leverage all the benefits of private ISP’s, we also started routing the DropZone’s internet traffic via the POPs and Edge servers. Now, when a user connects to dropzone.paypal.com, the request first goes to the nearest Edge server, traversing through our backbone and border LTM (of the active data center) to reach the DropZone hosts.

(iv)DropZone current architecture

When you type dropzone.paypal.com in dnschecker.org, DNS resolves to different IPs worldwide pointing to PayPal’s nearest Edge servers.

(v)DNS resolution for dropzone.paypal.com across the globe

We observed significant improvements in latency for both inter and intracontinental use cases where our upload /download speed nearly doubled since we routed the traffic through Edge servers.

The following graph shows the trend in the file transfer speed for the past six months:

(vi) Speed (In bytes/ms) for the past six months

And now, it’s not just reduced latency but also better availability with our Disaster Recovery system in place!

Another critical change in our architecture was implementing a Disaster Recovery (DR) strategy to our system to ensure high availability and avoid any single point of failure elicited in the arch diagram(ii). So we added redundancy for DropZone by setting up active-passive instances in DC2 and DC3, respectively. DC2 serves as our primary data center with DC3 as a backup, and data is periodically replicated from the primary to secondary.

Transient failures can be tolerated by switching the active traffic to the DC3 data center while continuing to write to the DC2 storage. During a permanent fail-over, DC3 storage will start serving both reads and writes.

We have been switching traffic between DC2 and DC3 from time to time with no interruption to the users, keeping the availability of the DropZone system intact.

Conclusion

Enhancing our architecture with private ISP contributed to a 2X performance gain due to the reduction in the last mile latency.

Implementing the DR capability improved our overall system availability by making it resilient to any transient or permanent network failures.

Glossary

POP: Point Of Presence -Physical servers connecting to ISPs with basic routing capabilities.

Edge server: Minimal data center present in different geographic locations with computing and routing capabilities.

Backbone Network: Dedicated physical circuit interconnecting POPs.

LTM: Local Traffic Manager (Load balancer)

--

--