I have two identical folders, with same structure and contents like this:
folder_1
hello.txt
subfolder
byebye.txt
folder_2
hello.txt
subfolder
byebye.txt
if I compress them as tar.xz formats I get two different archives with two different file sizes (just a few bytes, but they’re not identical).
$ cd folder_1 && tar -Jcf archive.tar.xz * $ cd folder_2 && tar -Jcf archive.tar.xz *
I get:
folder_1/archive.tar.xz != folder_2/archive.tar.xz
and of course if I md5sum or sha1sum them I’ll get two different hashes
And that’s my problem… I need to check if a provided archive is identical to the one I have in my storage. I cannot use hashing nor just check file sizes.
Using zip instead of tar.xz works as zip always produces identical achives from identical files.
Why is this happening? Is there a way to prevent it?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Ok, the explanation given by ddnomad is correct. It’s about the timestamp.
Here is the solution:
add --mtime='1970-01-01 00:00:00' to tar command:
tar --mtime='1970-01-01 00:00:00' -Jcf archive.tar.xz *
This will force contents timestamp to a fixed value thus resulting in identical archives.
Method 2
There are a number of reasons why two tarballs of the same directory tree might differ. The main ones are:
-
Metadata such as ownership, timestamps, etc. may differ. To get a reproducible tar archive, you need to have the same ownership, permissions and timestamps. Make sure that you copied all the metadata (if you have identical file contents with differing metadata,
cp -a --attributes-onlymay help). With GNU tar, there are a few options you can use to ignore certain attributes:--numeric-owneronly stores numerical user and group IDs, not names.--ownerand--groupforce files to be recorded under a certain user and group respectively (e.g.--owner=0 --group=0to record all files as belonging to root).--set-mtimeallows you to store all files with a particular timestamp instead of the real one.
-
The order in which the files are stored may differ. Most filesystems don’t give any particular guarantee as to the order in which files are listed in a directory, and
tarlists them as they come. (You can see the order withls -U.) GNU tar 1.28 has a new option--sort=name. With older versions or other implementations, you can get a reproducible file order by building a sorted list of file names and passing it to tar:find . -print0 | LC_ALL=C sort -z | tar --no-recursion -Jcf ../archive.tar.xz -T -
You may be interested in the Debian wiki page on reproducible builds.
Method 3
Every file (folder is a file also) has an embedded time stamp.
I presume you can’t create these to folder structures in the same time so time stamps for these files are different.
As the result, archiving or hashing would give you different outcomes as time stamp is a part of file that is used in both operations.
So that’s the difference between seemingly identical file structures.
UPDATE: as of checking they have similar contents I guess you have actually to check the contents of these files and compare them.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0