Standard Unix utilities like grep and diff use some heuristic to classify files as “text” or “binary”. (E.g. grep‘s output may include lines like Binary file frobozz matches.)
Is there a convenient test one can apply in a zsh script to perform a similar “text/binary” classification? (Other than something like grep '' somefile | grep -q Binary.)
(I realize that any such test would necessarily be heuristic, and therefore imperfect.)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
If you ask file for just the mime-type you’ll get many different ones like text/x-shellscript, and application/x-executable etc, but I imagine if you just check for the “text” part you should get good results. Eg (-b for no filename in output):
file -b --mime-type filename | sed 's|/.*||'
Method 2
Another approach would be to use isutf8 from the moreutils collection.
It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q) and exits with 1 otherwise.
Method 3
If you like the heuristic used by GNU grep, you could use it:
isbinary() {
LC_MESSAGES=C grep -Hm1 '^' < "${1-$REPLY}" | grep -q '^Binary'
}
It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random). In UTF-8 locales, it also flags on byte sequences that don’t form valid UTF-8 characters. It assumes LC_ALL is not set to something where the language is not English.
The ${1-$REPLY} form allows you to use it as a zsh glob qualifier:
ls -ld -- *(.+isbinary)
would list the binary files.
Method 4
You could try determining if iconv can read the file. This is less performing than file (which just reads a couple bytes from the beginning), but will give you more reliable results:
ENCODING=utf-8
if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
echo text
else
echo binary
fi
This makes iconv basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.
Method 5
You can write a script that calls file, and use a case-statement to check for the cases you are interested in.
For example
#!/bin/sh
case $(file "$1") in
(*script*|* text|* text *)
echo text
;;
(*)
echo binary
;;
esac
though of course there may be many special cases which are of interest. Just checking strings on a copy of libmagic, I see about 200 cases, e.g.,
Konqueror cookie text Korn shell script text executable LaTeX 2e document text LaTeX document text Linux Software Map entry text Linux Software Map entry text (new format) Linux kernel symbol map text Lisp/Scheme program text Lua script text executable LyX document text M3U playlist text M4 macro processor script text
Some use the string “text” as part of a different type, e.g.,
SoftQuad troff Context intermediate SoftQuad troff Context intermediate for AT&T 495 laser printer SoftQuad troff Context intermediate for HP LaserJet
likewise script could be part of a word, but I see no problems in this case. But a script should check for "text" as a word, not a substring.
As a reminder, file output does not use a precise description which would always have “script” or “text”. Special cases are something to consider. A followup commented that the --mime-type works while this approach would not, for .svg files. However, in a test I see these results for svg-files:
$ ls -l *.svg -r--r--r-- 1 tom users 6679 Jul 26 2012 pumpkin_48x48.svg -r--r--r-- 1 tom users 17372 Jul 30 2012 sink_48x48.svg -r--r--r-- 1 tom users 5929 Jul 25 2012 vile_48x48.svg -r--r--r-- 1 tom users 3553 Jul 28 2012 vile-mini.svg $ file *.svg pumpkin_48x48.svg: SVG Scalable Vector Graphics image sink_48x48.svg: SVG Scalable Vector Graphics image vile-mini.svg: SVG Scalable Vector Graphics image vile_48x48.svg: SVG Scalable Vector Graphics image $ file --mime-type *.svg pumpkin_48x48.svg: image/svg+xml sink_48x48.svg: image/svg+xml vile-mini.svg: image/svg+xml vile_48x48.svg: image/svg+xml
which I selected after seeing a thousand files show only 6 with “text”
in the mime-type output. Arguably, matching the “xml” on the end of the mime-type output could be more useful, say, than matching “SVG”, but using a script to do that takes you back to the suggestion made here.
The output of file requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them “data”).
There is more than one implementation of file. The one most commonly used does its work in libmagic, which can be used from different programs (perhaps not directly from zsh, though python can).
According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T option which it can use to provide this information. But it lists no comparable feature for zsh.
Further reading:
Method 6
file has an option --mime-encoding that attempts to detect the encoding of a file.
$file --mime-encoding Documents/poster2.pdf Documents/poster2.pdf: binary $file --mime-encoding projects/linux/history-torvalds/Makefile projects/linux/history-torvalds/Makefile: us-ascii $file --mime-encoding graphe.tex Dgraphe.tex: us-ascii $file --mime-encoding software.tex software.tex: utf-8
You can use file --mime-encoding | grep binary to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.
For example, I alias cat to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:
#! /bin/sh -
[ ! -t 1 ] && exec /bin/cat "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="072347">[email protected]</a>"
for i
do
if file --mime-encoding -- "$i" | grep -q binary
then
hexdump -C -- "$i"
else
/bin/cat -- "$i"
fi
done
Method 7
Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.
So, what do you want to do with that classification?
- If you want to select ascii/binary in FTP, it’s important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.
- If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.
- Any other case… may have any other definition.
Method 8
perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'
will do it. See documentation for -B and -T (search in that page for the string The -T and -B switches work as follows).
Method 9
I contributed to https://github.com/audreyr/binaryornot
It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
It uses a fairly efficient heuristic to determine if a file is text or binary.
Method 10
I now this answer is a bit old, but I think my friend taught me a great “hack” to do this.
You use the diff command and check your file against a test text file:
$ diff filetocheck testfile.txt
Now if filetocheck is a binary file, the output would be:
Binary files filetocheck and testfile.txt differ
This way you could leverage the diff command and e.g. write a function which does the check in a script.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0