Parsyncfp2
MultiHost parallel rsync wrapper
Install / Use
/learn @hjmangalam/Parsyncfp2README
parsyncfp2
a MultiHost parallel rsync wrapper writ in Perl. by Harry Mangalam hjmangalam@gmail.com Released under GPL v3.
(Changes moved to the bottom of this file)
Background
NB: If you don't want to transfer at least 10s of GB across a network, this is probably not the the tool you want. Use rsync alone if you need or will need a sync operation, or scp if the data needs to be encrypted.
parsyncfp2 (aka pfp2) is the next generation of the family that started with parsync, which with Ganael LaPlanche's fpart, begat parsyncfp (aka pfp), which has further mutated into the MultiHost multi-send, multi-receive organism unimaginatively called parsyncfp2.
Like parsyncfp, which uses fpart to aggregate files into chunks (or partitions) to allocate to individual rsyncs, pfp2 operates similarly. The main difference between them is that pfp2 can spread the send and receive functions among multiple hosts (with a shared filesystem required on the sending side.) As with pfp, it collects files based on aggregate size into chunkfiles which can be fed to rsync on a chunk by chunk basis. This allows pfp to begin transferring files before the complete recursive descent of the source dir is complete. This feature can save many hours of prep time on very large dir trees. In addition, pfp2 can re-use the chunkfiles so generated so if there's an interruption, you can skip the re-generation of the chunkfile list (which is pretty fast, but for a PB filesystem can still take a long time and generate a lot of competing IO)
NB: recently fpart changed from starting its chunk files from 0 to 1, and this version of pfp2 is the first github release that tracks that change. Using fpart 1.5.1 works fine, as do the last couple of releases.
If your use involves transit over IB networks, parsyncfp requires 'perfquery' and 'ibstat', Infiniband utilities written by Hal Rosenstock < hal.rosenstock [at] gmail.com >
pfp2 is tested on Linux. The MacOSX port is in hibernation.
pfp2 needs to be installed only on the SOURCE end of the transfer and only works in local SOURCE -> remote TARGET mode (it won't allow remote local SOURCE <- remote TARGET, emitting an error and exiting if attempted). It requires that ssh shared keys be set up prior to operation see here. If it detects that ssh keys are NOT set up correctly, it will ask for permission to try to remedy that situation. Check your local and remote ssh keys to make sure that it has done so correctly. Typically, they're in your ~/.ssh dir.
It uses whatever rsync is available on the TARGET. It uses a number of Linux-specific utilities so if you're transferring between Linux and a FreeBSD host, install pfp2 on the Linux side.
Installation
Installation of 'parsyncfp2' is fairly simple. There's not yet a deb or rpm package, but the bits to make it work that are not part of a fairly standard Linux distro are the Perl scripts parsyncfp2, scut (like cut but a bit more flexible), and stats (spits out descriptive statistics of whatever is fed to it).
The rest of the dependents are listed here:
-
Debian/Ubuntu-like:
sudo apt install ethtool iproute2 fpart iw libstatistics-descriptive-perl infiniband-diags
git clone https://github.com/hjmangalam/parsyncfp2
cd parsyncfp2; cp parsyncfp2 scut stats ~/bin
-
RHel/Centos/Rocky-like:
sudo yum install iw fpart ethtool iproute perl-Env.noarch
perl-Statistics-Descriptive wireless-tools infiniband-diagsgit clone https://github.com/hjmangalam/parsyncfp2
cd parsyncfp2; cp parsyncfp2 scut stats ~/bin
Required utilities and packages
Should the above commands not fulfill the requirements or be missing from your set of repositories, the utilities are listed below.
- ethtool - query or control network driver and hardware settings. Install via repository.
- ip - show / manipulate routing, network devices, interfaces and tunnels. Install via repository.
- fpart - Sort and pack files into partitions. Now in many distro repositories, or install from the fpart github;
- scut - a more intelligent cut. Included in the parsyncfp2 github
- stats - calculate descriptive stats from STDIN. Included in the parsyncfp2 github
- Perl::Descriptive-Statistics - basic descriptive statistical functions, but pfp will work without it.
Recommended Utilities
- iwconfig - configure a wireless network interface. Needed only for WiFi. Install via repository.
- perfquery - query InfiniBand port counters. Needed only for InfiniBand. Install via repository.
- udr (experimental) - utility to send packets via UDP using the UDT library. Required for the --udr option. Install via github
Changes
2.59A
- fixed stupid, small, but fatal, hack introduced while fiddling with multi-host POD target which caused single-host commands to skip parse_rsync_target()
stats
- no change to pfp2, but added sample size estimation to stats.
2.59
- found and corrected the lagging print routines after launching the last rsyncs. via ..
- rearranging the data print routine as a sub, tighten up the timing and calculations; moved some vars to our(vars); points to better ways of rearranging a lot of vars for this and other libs.
- added a few more fpart/rsync overrun/collision detection stuff
- increase the number of BW checks before the WARN to prevent spamming the screen on startup, when there will usually be a delay before fpart produces enough usable chunks.
2.571
- fixed the zombie --ro problem (again) that refuses to die
- added surprising stats from GPFS to GPFS rsyncs. (~6x speedup over single rsync iwth --NP=12)
2.57
- verified that pfp2 now/again works across mounted filesystems, tho (probably) much slower than across networks. Still, it increases speed of rsyncs across (parallel) FSs significantly.
- added some more debug lines to estimate location of non-pfp2 error messages
- fixed the zombie --ro problem (again) that refuses to die
- added a printout of actual rsync commands under VERBOSE=3 for additional debugging. Maybe limit them to only 5 and then stop? Not yet..
2.56
- minor changes, add usability.
- in Multihost mode, added code for allowing t/csh shells (as well as previous sh-like shells) on SEND hosts (so setting remote RPATHS should work)
- some help edits to clarify how things work.
- more fixes for the rsync options. Now DON'T need double quoting - pfp2 now takes care of that internally. And a quick followup to fix that fix (if didn't supply an --ro, would die.)
- fixed erroneous 'fix' of # of rsyncs going (should almost always be at NP)
2.55
- some major fixes, but not major behavioral changes.
- figured out why rsync options (--ro) were failing sometimes - they need to be '"double quoted"' (and then re-double quoted) to make it thru getopt & sending to the SEND hosts
- thanks to GabeT for the bug report and fix for creating the partial pfp2 command that goes to SEND hosts.
- And alerting me to some other issues with that process. Prob more to come.
- fixed the case where SEND hosts (and remote servers) have users other than the originating account
- more interface cleanups, added a sub clearline to blank overwritten info lines.
- removed a bunch of zombie vars.
2.51
- major changes in this release.
- this is the first version that cooperates with the new version of fpart that starts numbering chunks at 1 rather than 0. So the current version of fpart 1.5.1 should work fine. This change in numbering allows special handling of files larger than chunk size, now in process. If you dig in the code, you'll see special handing for zillions of tiny files as well, also in progress, but also not well-debugged. Stay away from those options.
- the scrolling output now is maintained until all rsyncs spawned by the SEND hosts have ended. Before, pfp2 ended when all of the rsyncs were started.
- fixed a major, if largely invisible bug in the way fpart was launched.
- slowly reducing functionally duplicated variables
- reduced # of tests that access filesystem via some primitive logic.
- the checkhost validation file (now "$HOME/.pfp-hostchecked") has been moved outside the .pfp dir tree so it survives the default deletion and renewal of ~/.pfp2
- there are some files that pfp2 is missing: a file named '2.1.1'$'\n' including the enclosing 's and the literal '$'\n' was skipped.
- have started to doc internal functional chunks with the prefix ##: If you actuall want to get a sense of where the bits are that might address a bug you found, try "grep -n '##:' parsyncfp2"
- trying to find the balance between helpful and annoying verbosity by merging, removing frequent or large text emissions.
- many, many bugs fixed. Many, many remain, but seem to be lower level.
2.44
- added some regex to prevent the kill script from killing the remote rsync daemon in case you're using it as root. Very much NOT recommended, but some ppl require this, apparently.
- fixed the bug that caused 2 dirs for each send host to be created if you specified them with something other than the correct short host name ('hostname -s'). ie: if you spec'ed a send host as '128.200.43.11' it would create a dir named that, and later, the 'hostname -s' of the same machine. Cost ~.5s per host, but fixes it simply.
2.43
- removed the option of calling it '--rsyncopts', now just --ro to prevent some regex filtering problems in the past.
2.42
- add --skipto - esp useful with --reuse to skip to that chunk without wading thru all the intervening fpart chunks and rsyncs. Could save minutes to 10s of min in restarting a huge transfer, especially if the network has significant delay and the --slowdown delay is significant. so --reuse and skipto are both required and just s
