GSOC 2022: Center for Translational Data Science

7 min readSep 12, 2022

Scalable data download functionality into the Gen3 Python SDK & CLI

Introduction:

As GSoC 2022 comes to an end, I realise how grateful I am for this to be my first major professional experience. The programme eases contributors into the world of open source gradually, so that even beginners like me don’t end up feeling overwhelmed, but instead see the limitless opportunities present before us.

I would also like to thank my mentor Michael Lukowski for guiding me throughout the project patiently and being available all week, despite having a busy work week, for solving even the smallest of doubts.

This blog is about my GSOC experience working under the CTDS organisation. My final project is available here.

About the organisation:

The Center for Translational Data Science at the University of Chicago develops and operates large-scale data platforms to support research in topics of societal interest.

The organisation operates a data ecosystem with their partners, comprising over a dozen data commons that make over 10 PB of data available to the research community. These are all based on the open-source Gen3 data platform, which includes the Gen3 data commons, Gen3 Framework Services, and Gen3 Workspaces.

Gen3 SDK for Python

The Gen3 Software Development Kit (SDK) for Python provides classes and functions for handling common tasks when interacting with a Gen3 commons. It also exposes a Command Line Interface (CLI).

The API for a commons can be overwhelming, so this SDK/CLI aims to simplify communication with various microservices.

About the project:

The project involves the implementation of data download client functionality into the Gen3 Python SDK & CLI to provide a single tool for interacting with Gen3
The aim was to speed up the download process by making use of various asynchronous libraries available in python, such as asyncio
The project is written in python with concepts of concurrency

Why I chose this project:

I wanted to explore the topics of concurrency and parallelism. Asynchronicity in python was something I was very new to and after observing it in other coding languages, I wanted to explore how the python modules perform the same
I have previously learnt Python and worked in Golang too, which were the criteria listed in the project description; hence I felt I would not have to start from the basics, but could directly start working on the problem statement

Deliverables of the project:

Asynchronous download functionality is available
User-friendly CLI implemented to use the download functionality
100% coverage of new code by unit tests

To-do:

PR is still under review and has to be merged with a few minor changes
Further testing on the asynchronous part of the code
Adding CLI for the download-single functionality too (downloads single file when provided with object_ID)

The Download functionality:

The aim of my project was to make use of newer python libraries such as asyncio to make the download of petabytes of data very fast for researchers.

Multitasking vs Multiprocessing vs Asyncio — which is better?

Since multiprocessing is not required if the computation is not all that intensive, multithreading can avoid the overhead associated with spinning up interpreters for each process in multiprocessing.

In threading, the operating system knows about each thread and can interrupt it at any time to start running a different thread (pre-emptive multitasking). Asyncio, however, uses cooperative multitasking. The tasks must cooperate by announcing when they are ready to be switched out.

The benefit of doing this extra work upfront is that you always know where your task will be swapped out. It will not be swapped out in the middle of a Python statement unless that statement is marked.

The project currently uses asyncio for creating tasks to run coroutines, aiohttp to make HTTP requests and aiofiles to write to files. The above python libraries have functions which execute asynchronously. Semaphores have also been used so that the server does not get overwhelmed with too many requests.

Event Loop, Tasks, Coroutines and await

Event loops run asynchronous tasks and callbacks, perform network IO operations, and run subprocesses.

For each call to the function, a task is created. Tasks are objects used for calling coroutines, which are functions that can be entered and exited multiple times, suspended and resumed each time. When a particular call to a function is waiting for a response or a return from any I/O operation, the function can be suspended to do some other work and can be resumed later at the same point.

The await keyword is used to mark certain commands, to indicate to the event loop that it has to wait for the command to be executed before moving on to the next command (thus bringing in cooperative multitasking).

The Command Line Interface:

The CLI is written using the click library in the whole organisation. I’ve added my functionality as

gen3 file download-manifest <manifest_file> — cred= <credentials.json> — path = <download_path> — semaphores=<number_of_semaphores>

After executing the above command, the terminal also shows a progress bar to show each file’s download progress and a progress bar to show the progress with respect to the manifest.

For my future scope, I also want to add functionality to download a single file by providing its object-ID:

gen3 file download-single <object-id> — cred=<credentials.json> — path=<download_path>

Testing

The code has been covered by unit tests written with the help of python’s pytest library.

It covers various edge cases, such as testing download requests when the user has no/incorrect authorisation, or requests where the manifest provided by the user is in the wrong format.

The process

The user provides the manifest they want to download, along with their credentials(authorisation) and the path where they want to store the downloaded files, in the command line interface
The manifest is processed and stored as python objects
For each entry in the manifest, a task is created and appended to a list of tasks. Later, using asyncio.gather(*tasks), all the instances of the coroutine are called asynchronously
For each task, the function gets the download URL by using pre-existing gen3 functions and streams the response content to write to a file.
The content is streamed to allow iter_content on the Response object to iterate the data from an open connection. iter_content will still work without stream, but it will download the response into memory first (probably doubling the execution time to achieve the same result).

Future Scope:

Download multiple files in the background and return:

which files were duplicates
which were renamed
which encountered an error

Challenges I faced

Familiarizing myself with the repository took longer than I expected, but my mentor helped me out by pointing out the information relevant to my work
How to write code asynchronously; it took me a while to figure out how to use the asyncio library as efficiently as possible to speed up the download
How to write unit tests; mocking all the responses from the requests library was a whole new task I’ve never done before and hence took a little more research and tries

What I learnt in GSoC

How to use GitHub
How the transparency of open source code pushed me to do better; even if I couldn’t complete the project, I knew if I left a good code base for someone new to come along and contribute, it would still be a good contribution to the organisation
Learned all about the asynchronous python libraries
How to add to an already existing code base without disrupting what was already working before my project
How important unit testing is for merging new code with the code base

My GSOC journey (May 2022 — September 2022)

May 20 — June 12 (Community Bonding period):

The organisation added me to their slack and I introduced myself
Had my first one-on-one with my mentor, who welcomed me into the project and we planned out the course of events for the next three months

June 12 — July 29 (Coding Period phase 1):

Familiarised myself with the Gen3 SDK Python repository
Started working on the download functionality
Started working on the command line interface for the asynchronous download functionality
Midterm review

July 29 — September 5 (Coding Period phase 2):

Made changes to the asynchronous download code to make it more object-oriented
Combined the CLI and the download code
Performed unit tests on the code

Overall Experience

GSoC was a great experience, one which is rarely available in a fast-paced college life. I’ve learnt a lot in the course of the last 3 months about how to research, good coding habits and how to make your work presentable to the world. I would recommend GSoC to everyone I know, and would like to apply again next year, maybe for a larger project!

Thank you for reading the blog! Please leave your feedback in the comment section.

Sanskriti Mathuria

GSoC Contributor, 2022