Without proper introduction, it can be surprisingly tricky to send large amounts of data from your local machine to a cloud server. Thankfully, open source tools like rsync exist to make this process easier. My goal is to introduce these tools to fellow developers and help others (and remind my future self) reduce the set-up time for remote, cloud-based projects.
I found rsync to be most useful when transferring nearly 60 GB of image data for an image analysis project I recently completed. With optional verbosity commands, you can get a nice print out to ensure everything is happening as you hope for. My favorite feature of rsync is that it will continue wherever a sync has terminated. This is especially helpful when syncing large directories with many files. If you internet connection drops, you can execute the same command without needed to keep track of the last synced file.
This post assumes you have some familiarity with AWS EC2 and know how to launch a new instance. If you are unfamiliar with AWS EC2, you can read more information in the Amazon docs.
Step 1: Visit AWS and ramp up your EC2 Instance
Ramp up your EC2 instance by logging into AWS. Go to the EC2 dashboard → Instances → Actions → Instance State → Start.
Once your instance is running, copy the public DNS key, and go to your local machine command line.
You can tunnel into the instances with:
ssh -i /path_to/your_private_key.pem/ server_root@public-DNS
If prompted, type yes, and you’ll be on the the remote server.
Step 2: Check available filesystem volume.
This is a bit of a technical step, but can be important if you want to free up space for the files you hope to upload. If you have an empty server, you can move on to Step 4.
You can check for available space on the server with the df -h command in the command line. This will display and output like this:
ubuntu@ip-xxxxx:~/anaconda3$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 15G 0 15G 0%
/dev tmpfs 3.0G 8.9M 3.0G 1% /run
/dev/xvda1 73G 66G 7.6G 90% /
tmpfs 15G 0 15G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 15G 0 15G 0% /sys/fs/cgroup
/dev/loop1 17M 17M 0 100% /snap/amazon-ssm-agent/784
/dev/loop0 18M 18M 0 100% /snap/amazon-ssm-agent/930
/dev/loop2 91M 91M 0 100% /snap/core/6350
/dev/loop3 90M 90M 0 100% /snap/core/6130
/dev/loop4 88M 88M 0 100% /snap/core/5742
tmpfs 3.0G 0 3.0G 0% /run/user/1000
/dev/xvdf1 74G 15G 13G 84% /data
Here I can see that the bulk of my instances storage /dev/xvda1
is at 90% capacity.
However, I have another Elastic Storage Block attached to my instance, as we can see at the bottom of the output /dev/xvdf1.
Regardless of the architecture of your instance, you want to search for the main block storage and make sure there is available space. All of the tmpfs items refer to temporary storage (erased every time the instance is stopped).
Step 3: Check volume of specific files and directories, and remove unnecessary data.
If you server has a nearly full filesystem (and you don’t know why), it can be helpful to see what directories or files are taking up space. In an directory, the du -hs * command will tell you how much space each file or subdirectory is taking up.
For example, my servers /dev/xvda1 filesystem was had significant volume taken up by all of the native anaconda environments that came pre-installed on my server. The du-hs * command was a simple, handy way for me to locate unneeded files and free up space for more important data. I now incorporate this into my cloud-based workflow.
To remove files:
To remove directories:
rm -r /path_to_directory/directory_name
Step 4: Remote Sync!
First, check that rsync is installed on both your local and remote machines. Just type rsync — version. Rsync was pre-installed on both my computer and remote server, so no troubles there.
If it is not installed, a simple `sudo apt install rsync` will work on your remote server.
With rsync installed, we are ready to go! From your local machine, execute the following command.
rsync -av — progress -e “ssh -i /path_to/your_public_key.pem/” /absolute_path_to/local_files/ remote_server_root@public_DNS:/absolut_path/remote_directory_destination
Breakdown of the command:
- rsync: Hello, I want to use rsync!
- av: This is the recursive rsync command.
- — progress: This gives you a verbose print-out.
- -e “ssh -i /path_to/your_public_key.pem/”: This gives you permissions to write to your remote instance. Depending on how you have configured you private key, you may not need this. Make sure the quotes are the true ascii character. I ran into trouble when copy-pasting the command from a notes app that converted the quote to a different character.
- /absolute_path_to/local_files/: Including the trailing “/” makes a difference. Without the trailing slash the directory will be copied completely. With the trailing slash, the directory will not be copied, and all the files will be synced directly into the remote directory you specify.
- remote_server_root@public_DNS: This is your remote server.
- :/absolut_path/remote_directory_destination: And finally, the destination for your files on your remote server.
Any small error in the command or paths will result in an error. Please inspect your commands carefully before convincing yourself that rsync won’t work for you! It can be easy to have an error in such a long, path-y command!
Thanks for reading!
Link to the full rsync documentation.
To join our community Slack 🗣️ and read our weekly Faun topics 🗞️, click here⬇