What’s the quickest way to find duplicated files?

I found this command used to find duplicated files but it was quite long and made me confused.

For example, if I remove -printf "%sn", nothing came out. Why was that? Besides, why have they used xargs -I{} -n1?

Is there any easier way to find duplicated files?

[4a-o07-d1:root/798]#find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
0bee89b07a248e27c83fc3d5951213c1  ./test1.txt
0bee89b07a248e27c83fc3d5951213c1  ./test2.txt

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can make it shorter:

find . ! -empty -type f -exec md5sum {} + | sort | uniq -w32 -dD

Do md5sum of found files on the -exec action of find and then sort and do uniq to get the files having same the md5sum separated by newline.

Method 2

You can use fdupes. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

You can call it like fdupes -r /path/to/dup/directory and it will print out a list of dupes.

Update

You can give it try to fslint also. After setting up fslint, go to cd /usr/share/fslint/fslint && ./fslint /path/to/directory

Method 3

In case you want to understand the original command, let’s go though that step by step.

find -not -empty -type f

Find all non-empty files in the current directory or any of its subdirectories.

   -printf "%sn"

Print its size. If you drop these arguments, it will print paths instead, breaking subsequent steps.

 | sort -rn

Sort numerically (-n), in reverse order (-r). Sorting in ascending order and comparing as strings not numbers should work just as well, though, so you may drop the -rn flags.

 | uniq -d

Look for duplicate consecutive rows and keep only those.

 | xargs -I{} -n1

For each line of input (i.e. each size that occurs more than once), execute the following command, but replace {} by the size. Execute the command once for each line of input, as opposed to passing multiple inputs to a single invocation.

   find -type f -size {}c -print0

This is the command to run for each size: Find files in the current directory which match that size, given in characters (c) or more precisely bytes. Print all the matching file names, separated by null bytes instead of newlines so filenames which contain newlines are treated correctly.

 | xargs -0 md5sum

For each of these null-separated names, compute the MD5 checksum of said file. This time we allow passing multiple files to a single invocation of md5sum.

 | sort

Sort by checksums, since uniq only considers consecutive lines.

 | uniq -w32 --all-repeated=separate

Find lines which agree in their first 32 bytes (the checksum; after that comes the file name). Print all members of such runs of duplicates, with distinct runs separated by newlines.

Compared to the simpler command suggested by heemayl, this has the benefit that it will only checksum files which have another file of the same size. It pays for that with repeated find invocations, thus traversing the directory tree multiple times. For those reasons, this command is particularly well-suited for directories with few but big files, since in those cases avoiding a checksum call may be more important than avoiding repeated tree traversal.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x