The number of open source libraries I have used is impossible to count and there’s hardly a month goes by without finding a new way to use one or a new one to use. There are so many widely used and hugely useful libraries in the Java ecosystem that we often take them for granted but there is a community of people who dedicate their free time (or in many cases, all their time) to enhancing and fixing these libraries so that we all benefit. Contributing back to open source can become complicated when you work as a Software Engineer due to issues around intellectual property or possible unintended exposure of credentials or other sensitive data. While researching solutions to a problem at work I found a bottleneck in Alpakka’s SFTP connector and took the opportunity to open-source a solution to the problem. This was my first time contributing back to the open source community that I have gained so much from and I really enjoyed the experience. In this post I’ll go into the technical details of the issue at hand, how I improved Alpakka to get around the issue and how easy the process of contributing my changes back to the Alpakka project was.
Identifying a problem
I was looking at ways we could use streaming when fetching files via SFTP. At Workday we run many thousands of these jobs each day and the general pattern is to download the data from the remote SFTP server, encrypt and compress it on the fly, then forward the downloaded files to an internal object storage service akin to S3 or to an internal HDFS system. These files are then further processed by integrations or other services. With the ever-growing amount of data that our customers process we have seen the volumes of data transferred for these jobs grow substantially over time. In order to stay on top of our customers requirements we continually evaluate the performance and resiliency of our services and look for opportunities to improve them. It was one such effort recently that prompted my research of the benefits of using a streaming approach.
I was experimenting with Alpakka which declares itself as “an open source initiative to implement stream-aware, reactive, integration pipelines for Java and Scala”. Alpakka offers an FTP connector which also handles SFTP. During my testing I found the SFTP streaming to be incredibly slow, substantially slower than the command-line tool
sftp. When I looked at the internals to figure out what was happening I discovered that Alpakka uses SSHJ as the SFTP client, so I set up a simple SFTP transfer using SSHJ to see if the problem was in that library. However, I found that SSHJ was faster than when I used Alpakka. So there must be a bottleneck somewhere in Alpakka or how it was using SSHJ. So I set about investigating why it was slow and how I could improve it.
What can make SFTP slow?
SFTP stands for SSH File Transfer Protocol and it’s a protocol for sending/receiving binary packets over a secure channel provided by SSH. SFTP does not operate on whole files (like FTP/FTPS) but rather on file handles and offsets. So when you want to download a file for example, the SFTP client will do this by requesting a chunk of that file at a given offset, then the next chunk and so on until it has the whole file. Under the covers, some clients need each chunk to be returned before it requests the next one, serially requesting every chunk until the whole file is retrieved. If there is close to zero latency between client and server then this may give you acceptable performance, however as latency increases, there is a delay equivalent to the round-trip-time between the client and the server between each read request. As a result of this one-packet-at-a-time approach, clients can experience significant slowdowns when latency increases. To mitigate this some clients offer a ‘pipe-lining’ feature (it may be called something else depending on the client) where more than one read request will be sent by the client without requiring an acknowledgement of the previous read request first. This allows more than one in-flight read request and therefore reduces the impact of latency on the throughput possible.
Why was Alpakka’s SFTP connector slow?
Due to how Alpakka was using SSHJ, a single read request was sent, then Alpakka would wait for the response to that request before sending the next one. As latency between client and server increases, this results in a linear increase in the duration taken to download a given amount of data. The snippet below shows how Alpakka was using SSHJ before my changes. You can also view it on the Alpakka repository. In this snippet
Handler is a class in SSHJ.
SSHJ provides some higher-level APIs which I was using when comparing SSHJ vs Alpakka throughput and when I dug into the source code I found that when
SftpFileTransfer creates a connection to fetch a file, it makes use of a different API to parallelise read requests —
ReadAheadRemoteFileInputStream. See it on GitHub here.
The difference here is Alpakka’s use of
RemoteFileInputStream vs SSHJ’s use of
ReadAheadRemoteFileInputStream, the latter takes a parameter that sets the number of concurrent read requests that can be sent without acknowledgement. The key to improving the throughput performance of Alpakka was making use of this API.
How did I improve Alpakka’s SFTP throughput?
To improve Alpakka’s throughput for SFTP I amended the
retrieveFileInputStream method on the
SftpOperations class to optionally make use of additional read requests when a new parameter,
maxUnconfirmedReads, is set to a value greater than 1.
To ensure the improvements I was hoping for were observed I set up a test harness using a Docker image based on the LinuxServer.io openssh-server image and ran an SFTP server on it. Then I used
tc to simulate various latencies on the SFTP server and used Alpakka as the client. I saw significant benefits while using the
ReadAheadRemoteFileInputStream API, especially as latency grew. Below I’ve shown some of the results from my testing. Each test result represents the median duration of 10 tests with the same settings for latency and unconfirmed reads. Each test downloaded a single 1GB binary file filled with completely random data.
The command-line tool
sftp uses 64 read requests and I found this to be optimal under most conditions, going much beyond the figure of 64 the improvement was diminished however there may be situations where different settings work better.
Open-Sourcing My Changes
I had originally reported the performance issue on the Alpakka project but discovered the probable solution before anyone commented on it. So I set about investigating what I needed to do to allow me to open-source the changes and get them into the Alpakka project. As a Software Engineer at Workday Technology there are intellectual property and security considerations so I contacted the appropriate internal teams to ensure my changes were okay to contribute back. The high-level steps were as follows
- Check the project uses a compatible licence (Apache 2.0)
- Ensure I had been added to the existing Workday Corporate CLA for Lightbend and confirm that fact
- Have a security scan performed on my changes before making them public
- Get a peer review of my changes
- Push my changes to a forked repository on my public GitHub repo
- Open the public pull request against the Alpakka project
All of the steps above were done within a couple of days and the process was really simple and collaborative with a review from Sean Glover. All that was left to do was respond to any feedback on the pull request which consisted of some basic documentation updates and my PR was merged within a week or so. My changes were tagged for the next milestone release which means by some time in early 2021 they should be available for all Alpakka users to benefit from.
What I learned
It was interesting to discover during my research that the SFTP RFC Specification was never completed and so only exists as RFC drafts. For such a ubiquitous protocol it’s pretty strange that the specification is not clearly defined.
To fix the throughput issue in Alpakka there really wasn’t a lot of code to change (you can see a diff here) but once I understood the technical problem, it was a matter of understanding the conventions and styles used in the Alpakka code base and also the tooling used, how to run the tests etc. I’m used to Gradle when building projects at work, so using
sbt was new to me.
The docker image I created for testing my solution in Alpakka will actually assist us in setting up better performance harnesses for our existing SFTP client code and for evaluating replacements or upgrades in future and I plan to spend time building this into a more general purpose tool for testing any of our services that are latency-sensitive.
I also learned just how easy it is to contribute back to open source when the need arises and I would be happy to do it again.
I’ve included some links below that I found useful during my research for tackling this issue.
- A nice article detailing how pipe-lining was implemented in cURL and why by its lead developer, Chris Stenberg
- A great paper by Chris Rapier on the inefficiencies inherent in the SFTP protocol and how he has approached tackling these with patches to the OpenSSH client