Is it possible to use the find command to find all the “non-binary” files in a directory? Here’s the problem I’m trying to solve.
I’ve received an archive of files from a windows user. This archive contains source code and image files. Our build system doesn’t play nice with files that have windows line endings. I have a command line program (flip -u) that will flip line endings between *nix and windows. So, I’d like to do something like this
find . -type f | xargs flip -u
However, if this command is run against an image file, or other binary media file, it will corrupt the file. I realize I could build a list of file extensions and filter with that, but I’d rather have something that’s not reliant on me keeping that list up to date.
So, is there a way to find all the non-binary files in a directory tree? Or is there an alternate solution I should consider?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I’d use file and pipe the output into grep or awk to find text files, then extract just the filename portion of file‘s output and pipe that into xargs.
something like:
file * | awk -F: '/ASCII text/ {print $1}' | xargs -d'n' -r flip -u
Note that the grep searches for ‘ASCII text’ rather than any just ‘text’ – you probably don’t want to mess with Rich Text documents or unicode text files etc.
You can also use find (or whatever) to generate a list of files to examine with file:
find /path/to/files -type f -exec file {} + |
awk -F: '/ASCII text/ {print $1}' | xargs -d'n' -r flip -u
The -d'n' argument to xargs makes xargs treat each input line as a separate argument, thus catering for filenames with spaces and other problematic characters. i.e. it’s an alternative to xargs -0 when the input source doesn’t or can’t generate NULL-separated output (such as find‘s -print0 option). According to the changelog, xargs got the -d/--delimiter option in Sep 2005 so should be in any non-ancient linux distro (I wasn’t sure, which is why I checked – I just vaguely remembered it was a “recent” addition).
Note that a linefeed is a valid character in filenames, so this will break if any filenames have linefeeds in them. For typical unix users, this is pathologically insane, but isn’t unheard of if the files originated on Mac or Windows machines.
Also note that file is not perfect. It’s very good at detecting the type of data in a file but can occasionally get confused.
I have used numerous variations of this method many times in the past with success.
Method 2
The accepted answer didn’t find all of them for me. Here is an example using grep’s -I to ignore binaries, and ignoring all hidden files…
find . -type f -not -path '*/.*' -exec grep -Il '.' {} ; | xargs -L 1 echo
Here it is in use in a practical application: dos2unix
https://unix.stackexchange.com/a/365679/112190
Method 3
No. There is nothing special about a binary or non-binary file. You can use heuristics like ‘contains only characters in 0x01–0x7F’, but that’ll call text files with non-ASCII characters binary files, and unlucky binary files text files.
Now, once you’ve ignored that…
zip files
If its coming from your Windows user as a zip file, the zip format supports marking files as either binary or text in the archive itself. You can use unzip’s -a option to pay attention to this and convert. Of course, see the first paragraph for why this may not be a good idea (the zip program may have guessed wrong when it made the archive).
zipinfo will tell you which files are binary (b) or text (t) in its zipfile listing.
other files
The file command will look at a file and try to identify it. In particular, you’ll probably find its -i (output MIME type) option useful; only convert files with type text/*
Method 4
A general solution to only process non-binary files in bash using file -b --mime-encoding:
while IFS= read -d '' -r file; do
[[ "$(file -b --mime-encoding "$file")" = binary ]] &&
{ echo "Skipping $file."; continue; }
echo "Processing $file."
# ...
done < <(find . -type f -print0)
I contacted the author of the file utility and he added a nifty -00 paramter in version 5.26 (released 2016-04-16, is e.g. in current Arch and Ubuntu 16.10) which prints fileresult for multiple files fed to it at once, this way you can do e.g.:
find . -type f -exec file -00 --mime-encoding {} + |
awk 'BEGIN{ORS=RS=""}{if(NR%2)f=$0;else if(!/binary/)print f}' | …
(The awk part is to filter out every file that isn’t non-binary. ORS is the output separator.)
Can be also used in a loop of course:
while IFS= read -d '' -r file; do
echo "Processing $file."
# ...
done < <(find . -type f -exec file -00 --mime-encoding {} + |
awk 'BEGIN{ORS=RS=""}{if(NR%2)f=$0;else if(!/binary/)print f}')
Based of this and the previous I created a little bash script for filtering out binary files which utilizes the new method using the -00 parameter of file in newer versions of it and falls back to the previous method on older versions:
#!/bin/bash
# Expects files as arguments and returns the ones that do
# not appear to be binary files as a zero-separated list.
#
# USAGE:
# filter_binary_files.sh [FILES...]
#
# EXAMPLE:
# find . -type f -mtime +5 -exec ./filter_binary_files.sh {} + | xargs -0 ...
#
[[ $# -eq 0 ]] && exit
if [[ "$(file -v)" =~ file-([1-9][0-9]|[6-9]|5.([3-9][0-9]|2[6-9])) ]]; then
file -00 --mime-encoding -- "[email protected]" |
awk 'BEGIN{ORS=RS=""}{if(NR%2)f=$0;else if(!/binary/)print f}'
else
for f do
[[ "$(file -b --mime-encoding -- "$f")" != binary ]] &&
printf '%s' "$f"
done
fi
Or here a more POSIX-y one, but it requires support for sort -V:
#!/bin/sh
# Expects files as arguments and returns the ones that do
# not appear to be binary files as a zero-separated list.
#
# USAGE:
# filter_binary_files.sh [FILES...]
#
# EXAMPLE:
# find . -type f -mtime +5 -exec ./filter_binary_files.sh {} + | xargs -0 ...
#
[ $# -eq 0 ] && exit
if [ "$(printf '%sn' 'file-5.26' "$(file -v | head -1)" | sort -V)" =
'file-5.26' ]; then
file -00 --mime-encoding -- "[email protected]" |
awk 'BEGIN{ORS=RS=""}{if(NR%2)f=$0;else if(!/binary/)print f}'
else
for f do
[ "$(file -b --mime-encoding -- "$f")" != binary ] &&
printf '%s' "$f"
done
fi
Method 5
find . -type f -exec grep -I -q . {} ; -print
This will find all regular files (-type f) in the current directory (or below) that grep thinks are non-empty and non-binary.
It uses grep -I to distinguish between binary and non-binary files. The -I flag and will cause grep to exit with a non-zero exit status when it detects that a file is binary. A “binary” file is, according to grep, a file that contains character outside the printable ASCII range.
The -q option to grep will cause it to quit with a zero exit status if the given pattern is found, without emitting any data. The pattern that we use is a single dot, which will match any character.
If the file is found to be non-binary and if it contains at least one character, the name of the file is printed.
If you feel brave, you can plug your flip -u into it as well:
find . -type f -exec grep -I -q . {} ; -print -exec flip -u {} ;
Method 6
Cas’s answer is good, but it assumes sane filenames; in particular it is assumed that filenames will not contain newlines.
There’s no good reason to make this assumption here, since it is quite simple (and actually cleaner in my opinion) to handle that case correctly as well:
find . -type f -exec sh -c 'file "$1" | grep -q "ASCII text"' sh {} ; -exec flip -u {} ;
The find command only makes use of POSIX-specified features. Using -exec to run arbitrary commands as boolean tests is simple, robust (handles odd filenames correctly), and more portable than -print0.
In fact, all parts of the command are specified by POSIX except for flip.
Note that file doesn’t guarantee accuracy of the results it returns. However, in practice grepping for “ASCII text” in its output is quite reliable.
(It might miss some text files perhaps, but is very very unlikely to incorrectly identify a binary file as “ASCII text” and mangle it—so we are erring on the side of caution.)
Method 7
Try this :
find . -type f -print0 | xargs -0 -r grep -Z -L -U '[^ -~]' | xargs -0 -r flip -u
Where the argument of grep '[^ -~]' is '[^<tab><space>-~]'.
If you type it on a shell command line, type Ctrl+V before Tab.
In an editor, there should be no problem.
'[^<tab><space>-~]'will match any character which is not ASCII text (carriage returns are ignore bygrep).-Lwill print only the filename of files who does not match-Zwill output filenames separated with a null character (forxargs -0)
Method 8
Alternate solution:
The dos2unix command will convert line endings from Windows CRLF to Unix LF, and automatically skip binary files. I apply it recursively using:
find . -type f -exec dos2unix {} ;
Method 9
sudo find / ( -type f -and -path ‘*/git/*’ -iname ‘README’ ) -exec grep -liI ‘100644|100755’ {} ; -exec flip -u {} ;
i.( -type f -and -path ‘*/git/*’ -iname ‘README’ ): searches for files within a path containing the name git and file with name README. If you know any specific folder and filename to search for it will be useful.
ii.-exec command runs a command on the file name generated by find
iii.; indicates end of command
iv.{} is the output of the file/foldername found from the previous find search
v.Multiple commands can be run on subsequently. By appending -exec “command” ; such as with -exec flip -u ;
vii.grep
1.-l lists the name of the file 2.-I searches only non-binary files 3.-q quiet output 4.'100644|100755' searches for either 100644 or 100755 within the file found. if found it then runs flip -u. | is the or operator for grep.
you can clone this test directory and try it out: https://github.com/alphaCTzo7G/stackexchange/tree/master/linux/findSolution204092017
more detailed answer here: https://github.com/alphaCTzo7G/stackexchange/blob/master/linux/findSolution204092017/README.md
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0