Is there a convenient way to classify files as "binary" or "text"?

Standard Unix utilities like grep and diff use some heuristic to classify files as “text” or “binary”. (E.g. grep‘s output may include lines like Binary file frobozz matches.)

Is there a convenient test one can apply in a zsh script to perform a similar “text/binary” classification? (Other than something like grep '' somefile | grep -q Binary.)

(I realize that any such test would necessarily be heuristic, and therefore imperfect.)

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Method 9

Method 10

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

If you ask file for just the mime-type you’ll get many different ones like text/x-shellscript, and application/x-executable etc, but I imagine if you just check for the “text” part you should get good results. Eg (-b for no filename in output):

file -b --mime-type filename | sed 's|/.*||'

Method 2

Another approach would be to use isutf8 from the moreutils collection.

It exits with 0 if the file is valid UTF-8 or ASCII, or short circuits, prints an error message (silence with -q) and exits with 1 otherwise.

Method 3

If you like the heuristic used by GNU grep, you could use it:

isbinary() {
  LC_MESSAGES=C grep -Hm1 '^' < "${1-$REPLY}" | grep -q '^Binary'
}

It searches for NUL bytes in the first buffer read from the file (a few kilo-bytes for a regular file, but could be a lot less for a pipe or socket or some devices like /dev/random). In UTF-8 locales, it also flags on byte sequences that don’t form valid UTF-8 characters. It assumes LC_ALL is not set to something where the language is not English.

The ${1-$REPLY} form allows you to use it as a zsh glob qualifier:

ls -ld -- *(.+isbinary)

would list the binary files.

Method 4

You could try determining if iconv can read the file. This is less performing than file (which just reads a couple bytes from the beginning), but will give you more reliable results:

ENCODING=utf-8
if iconv --from-code="$ENCODING" --to-code="$ENCODING" your_file.ext > /dev/null 2>&1; then
    echo text
else
    echo binary
fi

This makes iconv basically a no-op, but if it encounters invalid data (invalid UTF-8 in this example), it will barf and exit.

Method 5

You can write a script that calls file, and use a case-statement to check for the cases you are interested in.

For example

#!/bin/sh
case $(file "$1") in
(*script*|* text|* text *)
    echo text
    ;;
(*)
    echo binary
    ;;
esac

though of course there may be many special cases which are of interest. Just checking strings on a copy of libmagic, I see about 200 cases, e.g.,

Konqueror cookie text
Korn shell script text executable
LaTeX 2e document text
LaTeX document text
Linux Software Map entry text
Linux Software Map entry text (new format)
Linux kernel symbol map text
Lisp/Scheme program text
Lua script text executable
LyX document text
M3U playlist text
M4 macro processor script text

Some use the string “text” as part of a different type, e.g.,

SoftQuad troff Context intermediate   
SoftQuad troff Context intermediate for AT&T 495 laser printer
SoftQuad troff Context intermediate for HP LaserJet

likewise script could be part of a word, but I see no problems in this case. But a script should check for "text" as a word, not a substring.

As a reminder, file output does not use a precise description which would always have “script” or “text”. Special cases are something to consider. A followup commented that the --mime-type works while this approach would not, for .svg files. However, in a test I see these results for svg-files:

$ ls -l *.svg
-r--r--r-- 1 tom users  6679 Jul 26  2012 pumpkin_48x48.svg
-r--r--r-- 1 tom users 17372 Jul 30  2012 sink_48x48.svg
-r--r--r-- 1 tom users  5929 Jul 25  2012 vile_48x48.svg
-r--r--r-- 1 tom users  3553 Jul 28  2012 vile-mini.svg
$ file *.svg
pumpkin_48x48.svg: SVG Scalable Vector Graphics image
sink_48x48.svg:    SVG Scalable Vector Graphics image
vile-mini.svg:     SVG Scalable Vector Graphics image
vile_48x48.svg:    SVG Scalable Vector Graphics image
$ file --mime-type *.svg
pumpkin_48x48.svg: image/svg+xml
sink_48x48.svg:    image/svg+xml
vile-mini.svg:     image/svg+xml
vile_48x48.svg:    image/svg+xml

which I selected after seeing a thousand files show only 6 with “text”
in the mime-type output. Arguably, matching the “xml” on the end of the mime-type output could be more useful, say, than matching “SVG”, but using a script to do that takes you back to the suggestion made here.

The output of file requires some tuning in either scenario, and is not 100% reliable (it is confused by several of my Perl scripts, calling them “data”).

There is more than one implementation of file. The one most commonly used does its work in libmagic, which can be used from different programs (perhaps not directly from zsh, though python can).

According to File test comparison table for shell, Perl, Ruby, and Python , Perl has a -T option which it can use to provide this information. But it lists no comparable feature for zsh.

Method 6

file has an option --mime-encoding that attempts to detect the encoding of a file.

 $file --mime-encoding Documents/poster2.pdf 
Documents/poster2.pdf: binary
 $file --mime-encoding projects/linux/history-torvalds/Makefile 
projects/linux/history-torvalds/Makefile: us-ascii
 $file --mime-encoding graphe.tex 
Dgraphe.tex: us-ascii
 $file --mime-encoding software.tex 
software.tex: utf-8

You can use file --mime-encoding | grep binary to detect if a file is a binary file. It works reliably although it can get confused by a single invalid character in a long text file.

For example, I alias cat to the following shell script to avoid ruining my terminal by inadvertently opening a binary file:

#! /bin/sh -

[ ! -t 1 ] && exec /bin/cat "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="072347">[email protected]</a>"
for i
do
    if file --mime-encoding -- "$i" | grep -q binary
    then
        hexdump -C -- "$i"
    else
        /bin/cat -- "$i"
    fi
done

Method 7

Categories are arbitrary. Before answer how to make a classification, you need a (strict) definition. In order to have a definition, you need a purpose.

So, what do you want to do with that classification?

If you want to select ascii/binary in FTP, it’s important do not transfer a binary file as ascii (or it will be corrupted). So you shuld test if the file is plain texts, html, rtf, and some others. But in doubt, select binary. And maybe you also want to test that the file only have a subset like 0x0A, 0x0D, and 0x20-0x7F.
If you want to transfer the file in some protocol (POP3,SMTP) you need to test to choose if encode in base64 or just plain. In this case, you should test if there are unsupported characters.
Any other case… may have any other definition.

Method 8

perl -e'chomp(my$f=<>);print "binary$/" if -B $f;print "text$/" if -T _'

will do it. See documentation for -B and -T (search in that page for the string The -T and -B switches work as follows).

Method 9

I contributed to https://github.com/audreyr/binaryornot
It does not have a command line wrapper (yet) but this is a simple Python library easy enough to call even from the CLI.
It uses a fairly efficient heuristic to determine if a file is text or binary.

Method 10

I now this answer is a bit old, but I think my friend taught me a great “hack” to do this.

You use the diff command and check your file against a test text file:

$ diff filetocheck testfile.txt

Now if filetocheck is a binary file, the output would be:

Binary files filetocheck and testfile.txt differ

This way you could leverage the diff command and e.g. write a function which does the check in a script.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating