I have been using a rsync script to synchronize data at one host with the data at another host. The data has numerous small-sized files that contribute to almost 1.2TB.
In order to sync those files, I have been using rsync command as follows:
rsync -avzm --stats --human-readable --include-from proj.lst /data/projects REMOTEHOST:/data/
The contents of proj.lst are as follows:
+ proj1 + proj1/* + proj1/*/* + proj1/*/*/*.tar + proj1/*/*/*.pdf + proj2 + proj2/* + proj2/*/* + proj2/*/*/*.tar + proj2/*/*/*.pdf ... ... ... - *
As a test, I picked up two of those projects (8.5GB of data) and I executed the command above. Being a sequential process, it tool 14 minutes 58 seconds to complete. So, for 1.2TB of data it would take several hours.
If I would could multiple rsync processes in parallel (using &, xargs or parallel), it would save my time.
I tried with below command with parallel (after cding to source directory) and it took 12 minutes 37 seconds to execute:
parallel --will-cite -j 5 rsync -avzm --stats --human-readable {} REMOTEHOST:/data/ ::: .
This should have taken 5 times less time, but it didn’t. I think, I’m going wrong somewhere.
How can I run multiple rsync processes in order to reduce the execution time?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I would strongly discourage anybody from using the accepted answer, a better solution is to crawl the top level directory and launch a proportional number of rsync operations.
I have a large zfs volume and my source was a cifs mount. Both are linked with 10G, and in some benchmarks can saturate the link. Performance was evaluated using zpool iostat 1.
The source drive was mounted like:
mount -t cifs -o username=,password= //static_ip/70tb /mnt/Datahoarder_Mount/ -o vers=3.0
Using a single rsync process:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/ /StoragePod
the io meter reads:
StoragePod 30.0T 144T 0 1.61K 0 130M StoragePod 30.0T 144T 0 1.61K 0 130M StoragePod 30.0T 144T 0 1.62K 0 130M
This in synthetic benchmarks (crystal disk), performance for sequential write approaches 900 MB/s which means the link is saturated. 130MB/s is not very good, and the difference between waiting a weekend and two weeks.
So, I built the file list and tried to run the sync again (I have a 64 core machine):
cat /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount.log | parallel --will-cite -j 16 rsync -avzm --relative --stats --safe-links --size-only --human-readable {} /StoragePod/ > /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount_result.log
and it had the same performance!
StoragePod 29.9T 144T 0 1.63K 0 130M StoragePod 29.9T 144T 0 1.62K 0 130M StoragePod 29.9T 144T 0 1.56K 0 129M
As an alternative I simply ran rsync on the root folders:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/Marcello_zinc_bone /StoragePod/Marcello_zinc_bone rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/fibroblast_growth /StoragePod/fibroblast_growth rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/QDIC /StoragePod/QDIC rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/sexy_dps_cell /StoragePod/sexy_dps_cell
This actually boosted performance:
StoragePod 30.1T 144T 13 3.66K 112K 343M StoragePod 30.1T 144T 24 5.11K 184K 469M StoragePod 30.1T 144T 25 4.30K 196K 373M
In conclusion, as @Sandip Bhattacharya brought up, write a small script to get the directories and parallel that. Alternatively, pass a file list to rsync. But don’t create new instances for each file.
Method 2
Following steps did the job for me:
- Run the
rsync --dry-runfirst in order to get the list of files those would be affected.
$ rsync -avzm --stats --safe-links --ignore-existing --dry-run
--human-readable /data/projects REMOTE-HOST:/data/ > /tmp/transfer.log
- I fed the output of
cat transfer.logtoparallelin order to run 5rsyncs in parallel, as follows:
$ cat /tmp/transfer.log |
parallel --will-cite -j 5 rsync -avzm --relative
--stats --safe-links --ignore-existing
--human-readable {} REMOTE-HOST:/data/ > result.log
Here, --relative option (link) ensured that the directory structure for the affected files, at the source and destination, remains the same (inside /data/ directory), so the command must be run in the source folder (in example, /data/projects).
Method 3
I personally use this simple one:
ls -1 | parallel rsync -a {} /destination/directory/
Which only is useful when you have more than a few non-near-empty directories, else you’ll end up having almost every rsync terminating and the last one doing all the job alone.
Note the backslash before ls which causes aliases to be skipped. Thus, ensuring that the output is as desired.
Method 4
A tested way to do the parallelized rsync is: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallelizing-rsync
rsync is a great tool, but sometimes it will not fill up the available bandwidth. This is often a problem when copying several big files over high speed connections.
The following will start one rsync per big file in src-dir to dest-dir
on the server fooserver:cd src-dir; find . -type f -size +100000 | parallel -v ssh fooserver mkdir -p /dest-dir/{//}; rsync -s -Havessh {} fooserver:/dest-dir/{}The directories created may end up with wrong permissions and smaller files are not being transferred. To fix those run rsync a final time:
rsync -Havessh src-dir/ fooserver:/dest-dir/If you are unable to
push data, but need to pull them and the files are called digits.png
(e.g. 000000.png) you might be able to do:seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/
Method 5
For multi destination syncs, I am using
parallel rsync -avi /path/to/source ::: host1: host2: host3:
Hint: All ssh connections are established with public keys in ~/.ssh/authorized_keys
Method 6
I always google for parallel rsync as I always forget the full command, but no solution worked for me as I wanted – either it includes multiple steps or needs to install parallel. I ended up using this one-liner to sync multiple folders:
find dir/ -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo dir/%/ host:/dir/%/)'
-P 5 is the amount of processes you want to spawn – use 0 for unlimited (obviously not recommended).
--bwlimit to avoid using all bandwidth.
-I % argument provided by find (directory found in dir/)
$(echo dir/%/ host:/dir/%/) – prints source and destination directories which are read by rsync as arguments. % is replaced by xargs with directory name found by find.
Let’s assume I have two directories in /home: dir1 and dir2. I run find /home -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo /home/%/ host:/home/%/)'. So rsync command will run as two processes (two processes because /home has two directories) with following arguments:
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/ rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
Method 7
A more recent option to consider is using Fpsync, which wraps rsync but should be more efficient than launching an rsync process-per-file because it operates on “partitions” — batches of files. It also starts copying immediately — while still walking the directory tree — instead of waiting for the crawl to finish.
Here’s an example:
fpsync -vvv -o '-avm --numeric-ids --safe-links' -n 10 <SOURCE PATH> <DEST PATH>
I wish it provided nicer output… you actually get no output without at least specifying -v, but it does seem to maximize throughput for transfers.
Here’s the full manpage: http://manpages.ubuntu.com/manpages/bionic/man1/fpsync.1.html
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0