Let’s say I have a 4 GB file abc on my local computer. I have uploaded it to a distant server via SFTP, it took a few hours.
Now I have slightly modified the file (probably 50 MB maximum, but not consecutive bytes in this file) locally, and saved it into abc2. I also kept the original file abc on my local computer.
How to compute a binary diff of abc and abc2?
Applications:
-
I could only send a
patchfile (probably max 100MB) to the distant server, instead of reuploading the wholeabc2file (it would take a few hours again!), and recreateabc2on the distant server fromabcandpatchonly. -
Locally, instead of wasting 8 GB to backup both
abcandabc2, I could save onlyabc+patch, so it would take < 4100 MB only.
How to do this?
PS: for text, I know diff, but here I’m looking for something that could work for any raw binary format, it could be zip files or executables or even other types of file.
PS2: If possible, I don’t want to use rsync ; I know it can replicate changes between 2 computers in an efficient way (not resending data that has not changed), but here I really want to have a patch file, that is reproducible later if I have both abc and patch.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
For the second application/issue, I would use a deduplicating backup program like restic or borgbackup, rather than trying to manually keep track of “patches” or diffs. The restic backup program allows you to back up directories from multiple machines to the same backup repository, deduplicating the backup data both amongst fragments of files from an individual machine as well as between machine. (I have no user experience with borgbackup, so I can’t say anything about that program.)
Calculating and storing a diff of the abc and abc2 files can be done with rsync.
This is an example with abc and abc2 being 153 MB. The file abc2 has been modified by overwriting the first 2.3 MB of the file with some other data:
$ ls -lh
total 626208
-rw-r--r-- 1 kk wheel 153M Feb 3 16:55 abc
-rw-r--r-- 1 kk wheel 153M Feb 3 17:02 abc2
We create out patch for transforming abc into abc2 and call it abc-diff:
$ rsync --only-write-batch=abc-diff abc2 abc
$ ls -lh
total 631026
-rw-r--r-- 1 kk wheel 153M Feb 3 16:55 abc
-rw------- 1 kk wheel 2.3M Feb 3 17:03 abc-diff
-rwx------ 1 kk wheel 38B Feb 3 17:03 abc-diff.sh
-rw-r--r-- 1 kk wheel 153M Feb 3 17:02 abc2
The generated file abc-diff is the actual diff (your “patch file”), while abc-diff.sh is a short shell script that rsync creates for you:
$ cat abc-diff.sh
rsync --read-batch=abc-diff ${1:-abc}
This script modifies abc so that it becomes identical to abc2, given the file abc-diff:
$ md5sum abc abc2
be00efe0a7a7d3b793e70e466cbc53c6 abc
3decbde2d3a87f3d954ccee9d60f249b abc2
$ sh abc-diff.sh
$ md5sum abc abc2
3decbde2d3a87f3d954ccee9d60f249b abc
3decbde2d3a87f3d954ccee9d60f249b abc2
The file abc-diff could now be transferred to wherever else you have abc. With the command rsync --read-batch=abc-diff abc, you would apply the patch to the file abc, transforming its contents to be the same as the abc2 file on the system where you created the diff.
Re-applying the patch a second time seems safe. There is no error messages nor does the file’s contents change (the MD5 checksum does not change).
Note that unless you create an explicit “reverse patch”, there is no way to easily undo the application of the patch.
I also tested writing the 2.3 MB modification to some other place in the abc2 data, a bit further in (at about 50 MB), as well as at the start. The generated “patch” was 4.6 MB large, suggesting that only the modified bits were stored in the patch.
Method 2
How to compute a binary diff of abc and abc2?
Using bsdiff/bspatch or xdelta and others.
$ bsdiff older newer patch.bin # patch.bin is created [...] $ bspatch older newer patch.bin # newer is created
However, these admonishments from the man pages are to be noted:
bsdiffuses memory equal to 17 times the size of oldfile, and requires an absolute minimum working set size of 8 times the size of oldfile.bspatchuses memory equal to the size of oldfile plus the size of newfile, but can tolerate a very small working set without a dramatic
loss of performance.
Method 3
Have you tried just forcing diff to treat the files as text:
diff -ua abc abc2
As explained here.
-uoutput NUM (default 3) lines of unified context-atreat all files as text
This should get you a patch. The downside of this is the ‘lines’ could be quite long and that could bloat the patch.
Method 4
Use xdelta, it was created exactly for this type of uses. Based on VCDIFF (RFC 3284) in latest versions.
Method 5
Complements to other answers according to my tests:
With diff
I created two very similar 256 MB files abc and abc2. Then let’s create the diff file:
diff -ua abc abc2 > abc-abc2.diff
Now let’s try to recover abc2 thanks to the original abc file and abc-abc2.diff:
cp abc abc3 patch abc3 < abc-abc2.diff
or
cp abc abc3 patch abc3 -i abc-abc2.diff
or
patch abc -i abc-abc2.diff -o abc3
It works on Linux. I also tried on Windows (patch.exe and diff.exe are available too), but for an unknown reason it failed: the produced abc3 file is only 1KB instead of 256MB (I’ll update this answer later here).
With rsync
As detailed in the accepted answer, this works:
rsync --only-write-batch=abc-abc2-diff abc2 abc cp abc abc3 rsync --read-batch=abc-abc2-diff abc3
With rdiff
As detailed in this answer, this is a solution too:
rdiff signature abc abc-signature rdiff delta abc-signature abc2 abc-abc2-delta rdiff patch abc abc-abc2-delta abc3
Tested also on Windows with rdiff.exe from here and it works.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0