How to gather byte occurrence statistics in binary file?

I’d like to know the equivalent of

cat inputfile | sed 's/(.)/1n/g' | sort | uniq -c

presented in https://stackoverflow.com/questions/4174113/how-to-gather-characters-usage-statistics-in-text-file-using-unix-commands for production of character usage statistics in text files for binary files counting simple bytes instead of characters, i.e. output should be in the form of

It doesn’t matter if the command takes as long as the referenced one for characters.

If I apply the command for characters to binary files the output contains statistics for arbitrary long sequences of unprintable characters (I don’t seek explanation for that).

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

With GNU od:

od -vtu1 -An -w1 my.file | sort -n | uniq -c

Or more efficiently with perl (also outputs a count (0) for bytes that don’t occur):

perl -ne 'BEGIN{$/ = 4096};
          $c[$_]++ for unpack("C*");
          END{for ($i=0;$i<256;$i++) {
              printf "%3d: %dn", $i, $c[$i]}}' my.file

Method 2

For large files using sort will be slow. I wrote a short C program to solve the equivalent problem (see this gist for Makefile with tests):

#include <stdio.h>

#define BUFFERLEN 4096

int main(){
    // This program reads standard input and calculate frequencies of different
    // bytes and present the frequences for each byte value upon exit.
    //
    // Example:
    //
    //     $ echo "Hello world" | ./a.out
    //
    // Copyright (c) 2015 Björn Dahlgren
    // Open source: MIT License

    long long tot = 0; // long long guaranteed to be 64 bits i.e. 16 exabyte
    long long n[256]; // One byte == 8 bits => 256 unique bytes

    const int bufferlen = BUFFERLEN;
    char buffer[BUFFERLEN];
    int i;
    size_t nread;

    for (i=0; i<256; ++i)
        n[i] = 0;

    do {
        nread = fread(buffer, 1, bufferlen, stdin);
        for (i = 0; i < nread; ++i)
            ++n[(unsigned char)buffer[i]];
        tot += nread;
    } while (nread == bufferlen);
    // here you may want to inspect ferror of feof

    for (i=0; i<256; ++i){
        printf("%d ", i);
        printf("%fn", n[i]/(float)tot);
    }
    return 0;
}

usage:

gcc main.c
cat my.file | ./a.out

Method 3

As mean, sigma and CV are often important when judging statistic data of the content of binary files, I’ve created a cmdline program that graphs all this data as an ascii circle of byte deviations from sigma.
http://wp.me/p2FmmK-96
It can be used with grep, xargs and other tools to extract statistics.

Method 4

The recode program can do this quickly even for large files, either frequency statistics either for bytes or for the characters of various character sets. E.g. to count byte frequencies:

$ echo hello there > /tmp/q
$ recode latin1/..count-characters < /tmp/q
1  000A LF   1  0020 SP   3  0065 e    2  0068 h    2  006C l    1  006F o
1  0072 r    1  0074 t

Caution – specify your file to recode as standard input, otherwise it will silently replace it with the character frequencies!

Use recode utf-8/..count-characters < file to treat the input file as utf-8. Many many other character sets are available, and it will fail if the file contains any illegal characters.

Method 5

This is similar to Stephane’s od answer but it shows the ASCII value of the byte. It is also sorted by frequency / number of occurences.

xxd -c1 my.file|cut -c10-|sort|uniq -c|sort -nr

I don’t think this is efficient since many processes are started but it’s good for single files, particularly small files.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating