Open-Sourcing Sftp Throughput Improvements in Alpakka

Conor Griffin
Mar 4 · 7 min read
Photo by Khamkéo Vilaysing on Unsplash

The number of open source libraries I have used is impossible to count and there’s hardly a month goes by without finding a new way to use one or a new one to use. There are so many widely used and hugely useful libraries in the Java ecosystem that we often take them for granted but there is a community of people who dedicate their free time (or in many cases, all their time) to enhancing and fixing these libraries so that we all benefit. Contributing back to open source can become complicated when you work as a Software Engineer due to issues around intellectual property or possible unintended exposure of credentials or other sensitive data. While researching solutions to a problem at work I found a bottleneck in Alpakka’s SFTP connector and took the opportunity to open-source a solution to the problem. This was my first time contributing back to the open source community that I have gained so much from and I really enjoyed the experience. In this post I’ll go into the technical details of the issue at hand, how I improved Alpakka to get around the issue and how easy the process of contributing my changes back to the Alpakka project was.

Identifying a problem

I was looking at ways we could use streaming when fetching files via SFTP. At Workday we run many thousands of these jobs each day and the general pattern is to download the data from the remote SFTP server, encrypt and compress it on the fly, then forward the downloaded files to an internal object storage service akin to S3 or to an internal HDFS system. These files are then further processed by integrations or other services. With the ever-growing amount of data that our customers process we have seen the volumes of data transferred for these jobs grow substantially over time. In order to stay on top of our customers requirements we continually evaluate the performance and resiliency of our services and look for opportunities to improve them. It was one such effort recently that prompted my research of the benefits of using a streaming approach.

I was experimenting with Alpakka which declares itself as “an open source initiative to implement stream-aware, reactive, integration pipelines for Java and Scala”. Alpakka offers an FTP connector which also handles SFTP. During my testing I found the SFTP streaming to be incredibly slow, substantially slower than the command-line tool sftp. When I looked at the internals to figure out what was happening I discovered that Alpakka uses SSHJ as the SFTP client, so I set up a simple SFTP transfer using SSHJ to see if the problem was in that library. However, I found that SSHJ was faster than when I used Alpakka. So there must be a bottleneck somewhere in Alpakka or how it was using SSHJ. So I set about investigating why it was slow and how I could improve it.

What can make SFTP slow?

SFTP stands for SSH File Transfer Protocol and it’s a protocol for sending/receiving binary packets over a secure channel provided by SSH. SFTP does not operate on whole files (like FTP/FTPS) but rather on file handles and offsets. So when you want to download a file for example, the SFTP client will do this by requesting a chunk of that file at a given offset, then the next chunk and so on until it has the whole file. Under the covers, some clients need each chunk to be returned before it requests the next one, serially requesting every chunk until the whole file is retrieved. If there is close to zero latency between client and server then this may give you acceptable performance, however as latency increases, there is a delay equivalent to the round-trip-time between the client and the server between each read request. As a result of this one-packet-at-a-time approach, clients can experience significant slowdowns when latency increases. To mitigate this some clients offer a ‘pipe-lining’ feature (it may be called something else depending on the client) where more than one read request will be sent by the client without requiring an acknowledgement of the previous read request first. This allows more than one in-flight read request and therefore reduces the impact of latency on the throughput possible.

Why was Alpakka’s SFTP connector slow?

Due to how Alpakka was using SSHJ, a single read request was sent, then Alpakka would wait for the response to that request before sending the next one. As latency between client and server increases, this results in a linear increase in the duration taken to download a given amount of data. The snippet below shows how Alpakka was using SSHJ before my changes. You can also view it on the Alpakka repository. In this snippet Handler is a class in SSHJ.

Alpakka’s original SFTP client code

SSHJ provides some higher-level APIs which I was using when comparing SSHJ vs Alpakka throughput and when I dug into the source code I found that when SftpFileTransfer creates a connection to fetch a file, it makes use of a different API to parallelise read requests — ReadAheadRemoteFileInputStream. See it on GitHub here.

SSHJ’s built-in SFTP client code

The difference here is Alpakka’s use of RemoteFileInputStream vs SSHJ’s use of ReadAheadRemoteFileInputStream, the latter takes a parameter that sets the number of concurrent read requests that can be sent without acknowledgement. The key to improving the throughput performance of Alpakka was making use of this API.

How did I improve Alpakka’s SFTP throughput?

To improve Alpakka’s throughput for SFTP I amended the retrieveFileInputStream method on the SftpOperations class to optionally make use of additional read requests when a new parameter, maxUnconfirmedReads, is set to a value greater than 1.

Enabling multiple read requests

Testing

To ensure the improvements I was hoping for were observed I set up a test harness using a Docker image based on the LinuxServer.io openssh-server image and ran an SFTP server on it. Then I used tc to simulate various latencies on the SFTP server and used Alpakka as the client. I saw significant benefits while using the ReadAheadRemoteFileInputStream API, especially as latency grew. Below I’ve shown some of the results from my testing. Each test result represents the median duration of 10 tests with the same settings for latency and unconfirmed reads. Each test downloaded a single 1GB binary file filled with completely random data.

The command-line tool sftp uses 64 read requests and I found this to be optimal under most conditions, going much beyond the figure of 64 the improvement was diminished however there may be situations where different settings work better.

Open-Sourcing My Changes

I had originally reported the performance issue on the Alpakka project but discovered the probable solution before anyone commented on it. So I set about investigating what I needed to do to allow me to open-source the changes and get them into the Alpakka project. As a Software Engineer at Workday Technology there are intellectual property and security considerations so I contacted the appropriate internal teams to ensure my changes were okay to contribute back. The high-level steps were as follows

  • Check the project uses a compatible licence (Apache 2.0)
  • Ensure I had been added to the existing Workday Corporate CLA for Lightbend and confirm that fact
  • Have a security scan performed on my changes before making them public
  • Get a peer review of my changes
  • Push my changes to a forked repository on my public GitHub repo
  • Open the public pull request against the Alpakka project

All of the steps above were done within a couple of days and the process was really simple and collaborative with a review from Sean Glover. All that was left to do was respond to any feedback on the pull request which consisted of some basic documentation updates and my PR was merged within a week or so. My changes were tagged for the next milestone release which means by some time in early 2021 they should be available for all Alpakka users to benefit from.

What I learned

It was interesting to discover during my research that the SFTP RFC Specification was never completed and so only exists as RFC drafts. For such a ubiquitous protocol it’s pretty strange that the specification is not clearly defined.

To fix the throughput issue in Alpakka there really wasn’t a lot of code to change (you can see a diff here) but once I understood the technical problem, it was a matter of understanding the conventions and styles used in the Alpakka code base and also the tooling used, how to run the tests etc. I’m used to Gradle when building projects at work, so using sbt was new to me.

The docker image I created for testing my solution in Alpakka will actually assist us in setting up better performance harnesses for our existing SFTP client code and for evaluating replacements or upgrades in future and I plan to spend time building this into a more general purpose tool for testing any of our services that are latency-sensitive.

I also learned just how easy it is to contribute back to open source when the need arises and I would be happy to do it again.

Research

I’ve included some links below that I found useful during my research for tackling this issue.

The Startup

Get smarter at building your thing. Join The Startup’s +788K followers.

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Conor Griffin

Written by

Software Engineer at Workday

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +788K followers.

Conor Griffin

Written by

Software Engineer at Workday

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +788K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store