Rsync

schaffung
Dev Genius
Published in
6 min readJul 13, 2020

--

I have been exploring the Geo-replication feature of glusterfs and have stumble upon a tool called rsync. What is rsync and what does it do ?

Photo by Drew Coffman on Unsplash

Imagine your personal library. With books stacked in the shelves neatly. You regularly dust them, refer them in times of research or even for fun and also expand the library whenever you add a new book. This is a happy scenario.

All your books in one place. Well doesn’t sound so bad. But what if the house caught fire or even the library got destroyed due to some reason ( Get imaginative ). Now, that would be a huge loss for a person who’s dependent on these texts for his day to day activities.

Let’s suppose, you start a new library. This time, you want to be sure that you don’t go through the pain of loss of books and the task of getting back the collection ( which of-course won’t be the same ).

So, for every book you buy for your library, you buy a copy and just stash it in a place which is sort of removed from your daily contact. You’re just creating an exact replica of you library in say your Summer house.

Even if you loose all your books in the current library, you could still get back all your books back as you already had a replication someplace else.

Now, let’s talk a little technical. Book → Data. You have data, maybe you’re just managing a database and are concerned with what might happen if the current arrangement fails. By current arrangement, I meant all the servers which are working together now to host your database.

That would mean a complete loss of data and eventually business. One might say, let’s just keep a copy of the data we are creating. Sounds sane. So, instead of one single copy, you replicate your data in the servers which are working in tandem to host your database server. Now, what if instead of one of the servers failing, the complete infrastructure if brought down, maybe due to some natural disaster?

This is where a geo-replication comes handy. Data is not only replicated but also in a system which is removed from wherein the data is being replicated so that even if the current infrastructure fails, the replicated data in this scenario is safe from whatever caused the failure in the actual site at the first place.

There are a lot of softwares, open and proprietary which work in creating a back-up of your current files you’re working on. There are also some softwares which can backup your whole file system periodically. Glusterfs is a clustered network file system and one does understand the need for geo-replication and hence glusterfs has this feature.

We’ll look into the overall mechanism of Geo-replication using glusterfs in some other story, today the focus is on Rsync.

NOTE : Whatever I’m explaining from now on, might require a nix based system for proper understanding.

Now, the question is what does rsync do ?

If one were to follow the definition given in their site, it says “rsync is an open source utility that provides fast incremental file transfer”. Now, there is no mention of geo-replication here, but we do have something related to a fast incremental file transfer.

So using rsync, one can transfer files from system A to system B and it uses an incremental logic of doing so. That is, if earlier we copied the contents of a directory xyz in system A to directory mno in system B, after adding a new file to directory xyz, if we had initiated the rsync again, only the change would be transferred, not the complete directory again.

Installing rsync is pretty straightforward. In case of fedora/centos/rhel, one just have to do a simple

# dnf install rsync

Similarly, in case of debian and ubuntu,

# apt-get install rsync

Now, before jumping on to using rsync. One might get a question….one should actually. The question being, how can I transfer file from system A to system B. Does it ask for my credentials everytime? or is done automatically, maybe by storing my password or key?

Well, yes one would have to create a public key ( if it doesn’t exist already ) using the command,

# ssh-keygen

This will prompt the user asking the location and the name of the file wherein the private and the public keys will be created. Also, the passphrase. Now, the passphrase is like an extra layer of security if your keys are compromised. In this case, you can just have an empty passphrase.

Now, one point which I missed before was that the generation of key has to be done in the originating system, i.e. in a system from wherein you’d transfer the file, because it is from this system that one would be trying to access the backup system and telling it to accept the files. So that would imply the public credentials of the system A has to be present in system B for system B to verify if system A tries to contact it.

So, one has to copy the created public key to the system B. Now ssh utility takes care of this by giving one more command-line utility,

# ssh-copy-id <system_b_user>@<system_b_hostname>

This command would edit the authorized_keys file in the server. Then create a .ssh directory if it doesn’t exist already. Finally adds the key there.

So, we have our password-ssh setup. One can directly jump over rsync and start using it.

Let’s create a directory xyz in system A.

[root@systemA Desktop]# mkdir xyz

and mno in system B.

[root@systemB Desktop]# mkdir mno

Once this is done, let’s create say some 1K files in xyz using touch.

[root@systemA Desktop]# touch xyz/file{1..1000}

on listing, one could see the 1K files created inside the xyz directory. The next step now is to sync the files in directory xyz with directory mno using rsync.

[root@systemA Desktop]# rsync -av xyz/ root@systemB:/home/sysB/mno

Now, on listing the files under mno, one could see all the 1K files.

Let’s look into the options being used here.

  • a : this is useful for the following reasons,

1. recursively copy files and directories

2. copy symlinks as symlinks

3. preserve permissions

4. preserver group

5. preserve modification time

6. preserve ownership

  • v : verbose

Apart from these, we have other options too. one should go through the man page for rsync for getting a hold of all the options.

Now, coming back to the command which was used by us. The generic format can be though of in this manner,

rsync -<options> <source directory/files> <destination>

Now, destination could be a remote system or even some other directory in the same system.

Now, one thing to notice is that in the command we used to sync files across the system, we mentioned “xyz/” and not “xyz”. The reason being we wanted to just sync the contents inside xyz with the remote system and not the directory xyz itself.

Suppose, we have a scenario, wherein, I deleted a file in the directory xyz. Let’s say we just do the following,

[root@systemA Desktop]# rm xyz/file1
[root@systemA Desktop]# rsync -av xyz/ root@systemB:/home/sysB/mno

Will this sync the files? i.e., delete the file file1 in mno?

Well, no. It doesn’t.

For that, one would have to use the “ — delete” flag.

[root@systemA Desktop]# rsync -av --delete xyz/ \
root@systemB:/home/sysB/mno

Now, if one checks the directory mno, one could see that the files are synced and file1 is deleted.

We haven’t gone much into the depth of rsync, i.e. the internal workings and all. The reason being there are already a good amount of resource available in the internet on it’s workings. This story was written so that a user or a contributor to glustrfs could understand the significance of rsync in their work with glusterfs.

One would have noticed that rsync on it’s own cannot be used for geo-replication. It just does incremental file transfer. So the logic for geo-replication must be present in the piece of software around rsync in a given platform of filesystem.

--

--