Easily parallelize large scale data copies into Google Cloud Storage
(or deletions, moves, and rsync’s)
~~~ This story is unfinished! I will be adding more information, examples, explanations, and diagrams soon! ~~~
Note: This guide is geared more towards Google Cloud tools and products. But the basic principles should translate just fine to AWS or even NFS to NFS. I will give some examples using the tool rclone later on for multi-cloud operations. This guide also focuses on keeping things simple for now, there will be more advanced ways to accomplish this same logic at the end. I also link some tools I recommend that accomplish this goal.
Google Cloud is a wonderful platform for dealing with the ever growing problem of having too much data.
My day job at a genomic research institution frequently has me running commands across comically large sets of data, often in the multi-million (sometimes billion!) object count. Before joining this company I doubt id ever cracked 1000 files in a single copy command before, but soon enough I was casually tasked with moving petabytes into Google on a semi-regular basis.
For anyone who isnt quite used to dealing with this scale, it may seem daunting to have to deal with such large operations. But thankfully, if you have a few hosts (or one fairly large mem host, gsutil can be a memory pig) that you can commandeer for the task, we can chug through such a request in no time.
- Starting out, make sure you have gsutil or the gcloud SDK installed.
Both can be downloaded from here: https://cloud.google.com/sdk/install - Make sure you authenticate the client to an account that can perform the operations you plan to do. Usually this is done by issuing
gcloud init
or for a service accountgcloud auth activate-service-account
- We need to start with a list of the files/objects we plan to work with. If you already have this, then great! If not, we can easily generate this by
cd
‘ing into our chosen directory and runningfind $PWD -type f > list-of-files.txt
Note the use of $PWD, this prepends the absolute path to the printed filenames. If dealing with millions or billions of files this can often take some time to run. - With that file finished and saved somewhere, we can now work to chunk that file out into a number of source files. One quick and easy way to do this is to leverage the built in Linux tool
split
First, decide how many copies of gsutil you plan to kick off at once. This can be anywhere from 3–500+ depending on how many hosts you can leverage, lets refer to this number as X.
With that number X we can issue a command to split that source file into that X number of files, each containing close to the same number of files from the source directory to work with. Make sure you are currently in the same directory as the long list of files, or make sure you put the path to it below.split --number=r/X -e --numeric-suffixes -a list-of-files.txt file-chunk.
This command runs pretty quick, 4 million lines takes about 3 seconds.
The—-number=r
means we are round robin’ing the file names into each of our chunk files. Meaning we get an even spread between the files.
If you plan to produce over 100 chunks, you will likely need to add the option--suffix-length=3
otherwise you will get a suffix exhausted error. And for 1000 make it 4, and so on. - Now we have our input chunk files generated inside the current directory. All of them titled as
file-chunk.00
with the suffixes increasing up to the chosen number. These will be the files wecat
into gsutil.
So from there we can decide how we want to balance these out. If you are using multiple hosts you could write a quick bash script to ssh out each gsutil command to a different host. Or if you work for an institution with a job scheduler like we have, you can launch a task array of gsutil commands each using one of the source files. That is explained in more detail in the Using a Scheduler section below. - The general format of the commands we will be running on each of these chunk files is structured like
cat chunk-file.XX | gsutil -m COMMAND -I
The key here is the-I
flag used with gsutil. This tells gsutil to expect input from STDIN, which we provide bycat
‘ing the input file.
So for example, to run a copy of the source data defined in the first chunk listfile-chunk.00
to your google bucket calledgs://my-bucket
, you would run the followingcat file-chunk.00 | gsutil -m cp -I gs://my-bucket > log-file.txt
Its important to log the output from these commands to a file instead of printing to the terminal. When working with many many files it is much easier to search for things like errors in the output if we can parse a log file using tools like grep.
Also important is the-m
flag, which launches gsutil in multi-threaded mode, running more than one operation in parallel. If you are feeling adventurous you can even tune this value using a .boto file with theparallel_thread_count
andparallel_process_count
variables. More on that here: https://cloud.google.com/storage/docs/gsutil/commands/cp - Once we know the jobs have all finished, check the logs for errors. If you do see failures and believe them to be transient, rerunning is as simple as re-running the same command using the same chunk file. You may want to output to a new log file so we can double check that it has no errors the second time around.
If everything comes out with no errors than you are done!
Results:
Recently we needed to delete 7 billon objects in a Google Cloud bucket. We launched a task array on our scheduler with 100 tasks running at a time using this method, and we deleted all the objects in less than an hour. Fantastic. This ended up saving us 12k dollars of spend every day.
I’ve also had the pleasure of moving multi petabyte sets of data into Google and often I can max out our 40gbps network pipe and chug through the whole transfer in just a few days.
Scripting the Process:
Whats nice about this whole process is that its easily scriptable or automated. We ran through everything rather manually for the first time around so that you can learn what is happening behind the scenes and tweak if necessary. Tools like gsutil have plenty of various options and configurations you can play with to make things more customized to suit your workflow.
If you have the ability to write a quick script in something like bash or python, than automating these steps is fairly trivial.
The major steps for this process that you would likely want to automate are
generating the source list of files(or objects) that we plan to work with, running the chunk-file generation command, and then assembling a gsutil (or other) command that takes the newly generated chunk-files as inputs.
If you use a language like Python for example, modules like Paramiko
will allow you to ssh to a list of hosts you control, and run these generated commands on the remote host. This allows for the process to be easily scaled up, you are only limited by the number of hosts you can control and the bandwidth available to you if doing data transfers.
Using a Scheduler:
To take SGE for example, we can write a quick script that utilizes the $SGE_TASK_ID environment variable to iterate through our chunks.
First you would generate a file that contains our list of chunks find . -name "file-chunk.*" > dirfile.txt
Then we will write a quick bash script that we can submit to the scheduler, which we write to automatically format the chunk-file input for our chosen command.
As an example you could build a gsutil cp
script like this:
DIRFILE=dirfile.txt
CHUNK_FILE=$(awk “NR==$SGE_TASK_ID” $DIRFILE)
cat $CHUNK_FILE | gsutil -m cp -I > gsutil-log.$SGE_TASK_ID
Notes:
You big data transfer folks out there might be shaking your heads, saying to yourself how this wont help with balancing file size/count discrepancy, some hosts could be stuck chugging through large files, while others could be stuck doing millions of tiny files. Well this can be somewhat easily addressed by using a wonderful tool called fpart
https://github.com/martymac/fpart.
This tool will allow you to define the source chunk content based on the size of the files you are working with, meaning you get source size balanced processes so each host is doing a similar level of work at any given time.
Others might be asking “What about AWS?”. This method can easily be used with a fantastic tool called rclone
which allows a slew of various work to be performed across a variety of cloud providers https://rclone.org/.
This tool has a --files-from
option that will allow you to feed the same source chunk files, but you could leverage this to transfer to AWS or Azure. I've personally used this tool to move many petabytes over the years using the same logic outlined here.
I even built something combining the two more advanced tools above using a python wrapper to make things easier, its called dsync
This is found here: https://github.com/daltschu22/dsync
Id love to hear any feedback from folks, this tool works with rclone
and fpart
allowing for very fast large scale data transfers into the cloud. You can also just use regular old rsync
meaning big transfers locally on network storage work just as well.
I currently work for a genomic biotech in Boston. I love tacking big data challenges and would love to hear if this article helped you out at all! Please feel free to email me at DavidAltschuler@gmail.com with any questions, concerns, comments, etc.
Thank you for reading!