GSoC 2020 Final Report — Wikimedia: Transferpy Improvements

7 min readAug 25, 2020

Introduction

Transferpy is a database backup and recovery tool intended to move large files between WMF production hosts and backup MariaDB remotely in an efficient way. It has root privileges and can open firewalls, perform compression, encryption, and checksum.

The aim of this proposal was to make the service more usable and faster. The project proposal comprised of three significant changes:

1. Automatic free port detection: The transferpy needs a port to perform transfer between hosts, and the port needs to be explicitly specified by the user. Automating this frees the user from the burden of finding a free port in the machine.

2. Enable Parallel checksum: The transferpy is intended to transfer huge files which has a size in the order of TBs. The checksum computation could take as much time as that of the transfer time. Calculating it in parallel with the data transfer improves the execution time.

3. Enable multiprocessing for multiple destination transfers: The transferpy has the capability to transfer data from one source to multiple destinations. Enabling multiprocessing with this transfer should shorten the turnaround time, provided the network has enough bandwidth.

Automatic port detection and the concurrency issues which were introduced as a consequence of the addition of these functionalities were solved. The parallel checksum was completed and provided two options to the user viz. actual parallel transfer with lesser reliability and source only parallel transfer with more reliability. Enabling multiprocessing for transfer is still underway due to time constraints.

Possible improvements were identified in the project while working on the first task. Hence the decision was made to complete two more tasks. The first task was the packaging of transferpy for Debian, and the second task was sphinx documentation.

Work

Initially, the task was to refactor the repository. Formerly, transferpy was part of the wmfmariadbpy project, and as these projects were logically separated, transferpy was moved to its own repository to make the development easy and clean. The structure of transferpy was improved by modularising it into a Firewall, Transfererrer, and MariaDB with required attributes and functions to make the future development more straightforward. The logs were also improved by using the python logging module and by adding additional error messages. A new parameter, verbose, was introduced to allow the user to input the level of logging.

An integral part of the initial work was to find a way to detect a free port on the destination machine for netcat to listen to. The netstat utility was found to be useful in finding free ports and was used to find the same. There was a potential of two processes finding the same port as an open port at an instance of time (concurrency issue). The resolution was to use an interprocess lock-based solution.

While working on the concurrency issue, I understood the project better. Based on further discussions, few more tasks apart from the ones proposed initially were taken up. The first one was related to the output generation. The transferpy used to provide output for every command it executed on the remote machines because of Cumin. Changes were made to suppress this cumin output, providing a better experience to the user.

The second one was about the distribution of the code. A consensus exists that it would work great for the user if a package is available for Debian and could distribute it via the Wikimedia repository. As it would be easier to maintain and roll out updates, packaging was done.

The third one was related to the two-time documentation problem(in the wiki and code). Enough information for the user was present inside the code itself as comments and argument `help` options, which lead to consider making use of the same rather than writing content for wiki separately. So Sphinx, a documentation tool that automatically indexes the code using the comments present in the code, was leveraged for use. Sphinx also takes information from the `help` option and generates documentation as needed. These modifications changed the look of transferpy to a great extent. The documentation has been pushed to doc.wikimedia.com.

The next task was on parallel checksum. Since the files have a size in the order of TBs, the checksum calculation process itself takes a significant amount of time. So, the implementation of a parallel checksum using pipes and tee utilities was considered. The script was modified to calculate md5sum and write the checksum into a file. When the transfer is complete, the tool would compare the content of the md5sum file saved earlier with the transferred file’s checksum. A new parameter named parallel-checksum was created to make use of this feature. The existing checksum calculation was also improved by calculating the source checksum during the transfer itself. Parallel-checksum is faster but can only detect network errors. The standard checksum is slower but could also detect the disk errors. Hence both the options were kept. As I have been given two cloud machines with enough storage (2TB) for testing the results, I have benchmarked these options and observed significant improvement in the performance of the transferpy checksum calculation scenario.

The number of parameters was increasing, and it was found that it would be inconvenient for the user to remember everything. So a configuration file was introduced so that the user just needs to tweak it once for their use. The tool could take the options from the configuration file in subsequent transfers. The feature was implemented with the help of the configparser python module. The command-line arguments were given the most priority, followed by the configuration file and the default options having the least priority. Since it is required to create temporary files for locking and calculating the checksum, necessary changes were made to clean them properly at the end of execution, even with the occurrence of any possible errors/exceptions.

Incorporating the concept of multiprocess to the transfers was the next step. This was expected to help the user to transfer source data to multiple destinations simultaneously. The testing unveiled that multiprocessing improves the performance in an environment with a higher network bandwidth and good disk I/O. A POC code for this was implemented and tested. It is a work in progress and currently facing failure when transferring a higher workload in the test machines.

I have followed best practices, made required comments, have written documentation and tests for all the above-said features. The final product has better usability and also is better for future developments.

I am delighted that the code has reached production, and WMF machines are currently using it. The Debian package (version 1.0) is being used as a medium to distribute the final working product to the production.

The proposal: Click here
All the merged PRs: Click here
My Phabricator dashboard: Click here
Transferpy documentation: Click here
Transferpy Debian package: Click here

Challenges

Some unanticipated challenges emerged during the course of work. The concurrency issue that happens when multiple processes try to read or write from the same shared resource, causing side effects, was a major one among them and it came up multiple times.

The first occurrence was during the automatic port detection task. There was a possibility for transferpy to detect the same port number as a free port while two instances of transferpy were running simultaneously. This issue has been solved by implementing a locking mechanism. The issue occurred again while working on the parallel checksum. The intermediate file names used for storing calculated checksum caused concurrency issues. The creation of these intermediate files was necessary because the checksum calculation of a directory with a large number of files was resulting in a deadlock due to the limitation of pipe buffer size in python. This issue was fixed by naming it using destination port and hostname, which is a unique key.

Another challenge was related to the tox development environment. I was not much aware of that, and running everything via tox was easy compared to using direct commands. So I learned, implemented, and resolved the issues related to it when I wrote the documentation and packaging.

Packaging transferpy for Debian and releasing it was another exciting challenge for me as it was new to me. Also, the administration of assigned test machines from scratch was another interesting challenge.

In all the challenges I faced, the interaction I had with the mentors were incredibly helpful. They gave timely input and encouraged me to come up with effective solutions.

Experience

I have learned a lot of new things this summer. First, I learned how Wikimedia projects work using Gerrit and Phabricator. I was able to communicate with members using Zulip and IRC.

For the purpose of benchmarks, I have been assigned a couple of machines, and I have taken care of the administration of the same from scratch. It gave me knowledge of how Wikimedia servers are being monitored using the horizon and been configured using Puppet with the help of profile and role configurations. I was excited to have configured the machines allotted to me using Puppet, to have Cumin (a remote machine executor) and MariaDB with xtrabackup. It was an amazing experience. Everything was new and exciting.

This GSoC with Wikimedia was an enjoyable experience. The mentors I have been assigned with were very friendly, helpful, and supportive. The community as a whole is fabulous and enthusiastic. The GSoC-Outreachy video calls I had were remarkable. Everyone shared their stories in a biweekly report, and that was inspirational.

The coding experience was enlightening. I have learned a lot of best practices in coding and other GNU/Linux related concepts. The work was always enjoyable and insightful. Mentors were very generous in explaining why I should be doing things in a particular way. The Overall experience with the Wikimedia community was awesome. The mentors were very knowledgeable. I don’t have words to portray how fantastic they were.

Acknowledgment

I am very thankful to my mentors Jaime Crespo and Manuel Aróstegui. Without them, the work never would have been this joyful and rewarding. Jaime Crespo was always available to answer my questions on time. I love the Wikimedia community for its welcoming and enthusiastic nature.

I thank Hashar, Wikimedia, for his utmost help during the time of packaging and documentation. I also thank Srishti Sethi and Pavithra Eswaramoorthy, Wikimedia GSoC organization, for organizing GSoC-Outreachy meets and taking care of all the students’ needs and doubts.

I would also like to thank my friend Jyothis Jagan. He was generous and extremely helpful in proofreading and editing from the very start of the proposal to this final stage report.

Finally, thanks to the GSoC program, without which, I wouldn’t be a part of this awesome project and gain this memorable experience. I am thankful to them for creating this kind of opportunity for students.

Future Work

- Complete multiprocess transfer.
- Improve kill_job function in CuminExecution: Currently, kill_job function kills the subprocess in the transferpy running machine. Instead, it should kill the actual process in the remote machine.
- Improve transferpy with better GNU/Linux commands.