Find recursively all archive files of diverse archive formats and search them for file name patterns

At best I would like to have a call like this:

$searchtool /path/to/search/ -contained-file-name "*vacation*jpg"

… so that this tool

  • does a recursive scan of the given path
  • takes all files with supported archive formats which should at least be the “most common” like zip, rar, 7z, tar.bz, tar.gz …
  • and scan the file list of the archive for the name pattern in question (here *vacation*jpg)

I’m aware of how to use the find tool, tar, unzip and alike. I could combine these with a shell script but I’m looking for a simple solution that might be a shell one-liner or a dedicated tool (hints to GUI tools are welcome but my solution must be command line based).

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

If you want something simpler that the AVFS solution, I wrote a Python script to do it called arkfind. You can actually just do

$ arkfind /path/to/search/ -g "*vacation*jpg"

It’ll do this recursively, so you can look at archives inside archives to an arbitrary depth.

Method 2

(Adapted from How do I recursively grep through compressed archives?)

Install AVFS, a filesystem that provides transparent access inside archives. First run this command once to set up a view of your machine’s filesystem in which you can access archives as if they were directories:

mountavfs

After this, if /path/to/archive.zip is a recognized archive, then ~/.avfs/path/to/archive.zip# is a directory that appears to contain the contents of the archive.

find ~/.avfs"$PWD" ( -name '*.7z' -o -name '*.zip' -o -name '*.tar.gz' -o -name '*.tgz' ) 
     -exec sh -c '
                  find "$0#" -name "*vacation*.jpg"
                 ' {} 'Test::Version' ;

Explanations:

  • Mount the AVFS filesystem.
  • Look for archive files in ~/.avfs$PWD, which is the AVFS view of the current directory.
  • For each archive, execute the specified shell snippet (with $0 = archive name and $1 = pattern to search).
  • $0# is the directory view of the archive $0.
  • {} rather than {} is needed in case the outer find substitutes {} inside -exec ; arguments (some do it, some don’t).

Or in zsh ≥4.3:

mountavfs
ls -l ~/.avfs$PWD/**/*.(7z|tgz|tar.gz|zip)(e''
     reply=($REPLY#/**/*vacation*.jpg(.N))
'')

Explanations:

  • ~/.avfs$PWD/**/*.(7z|tgz|tar.gz|zip) matches archives in the AVFS view of the current directory and its subdirectories.
  • PATTERN(e''CODE'') applies CODE to each match of PATTERN. The name of the matched file is in $REPLY. Setting the reply array turns the match into a list of names.
  • $REPLY# is the directory view of the archive.
  • $REPLY#/**/*vacation*.jpg matches *vacation*.jpg files in the archive.
  • The N glob qualifier makes the pattern expand to an empty list if there is no match.

Method 3

My usual solution:

find -iname '*.zip' -exec unzip -l {} ; 2>/dev/null | grep '.zip|DESIRED_FILE_TO_SEARCH'

Example:

find -iname '*.zip' -exec unzip -l {} ; 2>/dev/null | grep '.zip|characterize.txt'

Resuls are like:

foozip1.zip:
foozip2.zip:
foozip3.zip:
    DESIRED_FILE_TO_SEARCH
foozip4.zip:
...

If you want only the zip file with hits on it:

find -iname '*.zip' -exec unzip -l {} ; 2>/dev/null | grep '.zip|FILENAME' | grep -B1 'FILENAME'

FILENAME here is used twice, so you can use a variable.

With find you might use PATH/TO/SEARCH

Method 4

Another solution that works is zgrep

zgrep -r filename *.zip

Method 5

IMHO user-friendliness should be a thing in bash as well :

 while read -r zip_file ; do echo "$zip_file" ; unzip -l "$zip_file" | 
 grep -i --color=always -R "$to_srch"; 
 done < <(find . ( -name '*.7z' -o -name '*.zip' )) | 
 less -R

and for tar ( this one is untested … )

 while read -r tar_file ; do echo "$tar_file" ; tar -tf  "$tar_file" | 
 grep -i --color=always -R "$to_srch"; 
 done < <(find . ( -name '*.tar.gz' -o -name '*.tar' )) | 
 less -R

Method 6

libarchive‘s bsdtar can handle most of those file formats, so you could do:

find . ( -name '*.zip' -o     
          -name '*.tar' -o     
          -name '*.tar.gz' -o  
          -name '*.tar.bz2' -o 
          -name '*.tar.xz' -o  
          -name '*.tgz' -o     
          -name '*.tbz2' -o    
          -name '*.7z' -o      
          -name '*.iso' -o     
          -name '*.cpio' -o    
          -name '*.a' -o       
          -name '*.ar' )      
       -type f                 
       -exec bsdtar tf {} '*vacation*jpg' ; 2> /dev/null

Which you can simplify (and improve to match case-insensitively) with GNU find with:

find . -regextype egrep 
       -iregex '.*.(zip|7z|iso|cpio|ar?|tar(|.[gx]z|.bz2)|tgz|tbz2)' 
       -type f 
       -exec bsdtar tf {} '*vacation*jpg' ; 2> /dev/null

That doesn’t print the path of the archive where those *vacation*jpg files are found though. To print that name you could replace the last line with:

-exec sh -ac '
   for ARCHIVE do
     bsdtar tf "$ARCHIVE" "*vacation*jpg" |
       awk '''{print ENVIRON["ARCHIVE"] ": " $0}'''
   done' sh {} + 2> /dev/null

which gives an output like:

./a.zip: foo/blah_vacation.jpg
./a.zip: bar/blih_vacation.jpg
./a.tar.gz: foo/blah_vacation.jpg
./a.tar.gz: bar/blih_vacation.jpg

Or with zsh:

setopt extendedglob # best in ~/.zshrc
for archive (**/*.(#i)(zip|7z|iso|cpio|a|ar|tar(|.gz|.xz|.bz2)|tgz|tbz2)(.ND)) {
  matches=("${(<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1a7c5a">[email protected]</a>)$(bsdtar tf $archive '*vacation*jpg' 2> /dev/null)"})
  (($#matches)) && printf '%sn' "$archive: "$^matches
}

Note that there are a number of other file formats that are just zip or tgz files in disguise like .jar or .docx files. You can add those to your find/zsh search pattern, bsdtar doesn’t care about the extension (as in, it doesn’t rely on the extension to determine the type of the file).

Note that *vacation*.jpg above is matched on the full archive member path, not just the file name, so it would match on vacation.jpg but also on vacation/2014/file.jpg.

To match on the filename only, one trick would be to use the extract mode, use -s (substitution) which uses regexps with a p flag to print the names of the matching files and then make sure no file is extracted, like:

bsdtar -'s|.*vacation[^/]*$||' -'s|.*||' -xf "$archive"

Note that it would output the list on stderr and append >> to every line. In any case, bsdtar, like most tar implementations may mangle the file names on display if they contain some characters like newline or backslash (rendered as n or \).


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x