Simultaneously calculate multiple digests (md5, sha256)?

Under the assumption that disk I/O and free RAM is a bottleneck (while CPU time is not the limitation), does a tool exist that can calculate multiple message digests at once?

I am particularly interested in calculating the MD-5 and SHA-256 digests of large files (size in gigabytes), preferably in parallel. I have tried openssl dgst -sha256 -md5, but it only calculates the hash using one algorithm.

Pseudo-code for the expected behavior:

for each block:
    for each algorithm:
        hash_state[algorithm].update(block)
for each algorithm:
    print algorithm, hash_state[algorithm].final_hash()

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Check out pee (“tee standard input to pipes“) from moreutils. This is basically equivalent to Marco’s tee command, but a little simpler to type.

$ echo foo | pee md5sum sha256sum
d3b07384d113edec49eaa6238ad5ff00  -
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c  -

$ pee md5sum sha256sum <foo.iso
f109ffd6612e36e0fc1597eda65e9cf0  -
469a38cb785f8d47a0f85f968feff0be1d6f9398e353496ff7aa9055725bc63e  -

Method 2

You can use a for loop to loop over the individual files and then use tee
combined with process substitution (works in Bash and Zsh among others) to
pipe to different checksummers.

Example:

for file in *.mkv; do
  tee < "$file" >(sha256sum) | md5sum
done

You can also use more than two checksummers:
for file in *.mkv; do
  tee < "$file" >(sha256sum) >(sha384sum) | md5sum
done

This has the disadvantage that the checksummers don’t know the file name,
because it is passed as standard input. If that’s not acceptable, you have to
emit the file names manually. Complete example:
for file in *.mkv; do
  echo "$file"
  tee < "$file" >(sha256sum) >(sha384sum) | md5sum
  echo
done > hashfilelist

Method 3

It’s a pity that the openssl utility doesn’t accept multiple digest commands; I guess performing the same command on multiple files is a more common use pattern. FWIW, the version of the openssl utility on my system (Mepis 11) only has commands for sha and sha1, not any of the other sha variants. But I do have a program called sha256sum, as well as md5sum.

Here’s a simple Python program, dual_hash.py, that does what you want. A block size of 64k appears to be optimal for my machine (Intel Pentium 4 2.00GHz with 2G of RAM), YMMV. For small files, its speed is roughly the same as running md5sum and sha256sum in succession. But for larger files it is significantly faster. Eg, on a 1967063040 byte file (a disk image of an SD card full of mp3 files), md5sum + sha256sum takes around 1m44.9s, dual_hash.py takes 1m0.312s.

dual_hash.py

#! /usr/bin/env python

''' Calculate MD5 and SHA-256 digests of a file simultaneously

    Written by PM 2Ring 2014.10.23
'''

import sys
import hashlib

def digests(fname, blocksize):
    md5 = hashlib.md5()
    sha = hashlib.sha256()
    with open(fname, 'rb') as f:
        while True:
            block = f.read(blocksize)
            if not block:
                break
            md5.update(block)
            sha.update(block)

    print("md5: %s" % md5.hexdigest())
    print("sha256: %s" % sha.hexdigest())

def main(*argv):
    blocksize = 1<<16 # 64kB
    if len(argv) < 2:
        print("No filename given!n")
        print("Calculate md5 and sha-256 message digests of a file.")
        print("Usage:npython %s filename [blocksize]n" % sys.argv[0])
        print("Default blocksize=%d" % blocksize)
        return 1

    fname = argv[1]

    if len(argv) > 2:
        blocksize = int(sys.argv[2])

    print("Calculating MD5 and SHA-256 digests of %r using a blocksize of %d" % (fname, blocksize))
    digests(fname, blocksize)

if __name__ == '__main__':
    sys.exit(main(*sys.argv))

I suppose a C/C++ version of this program would be a little faster, but not much, since most of the work is being done by the hashlib module, which is written in C (or C++). And as you noted above, the bottleneck for large files is IO speed.

Method 4

You could always use something like GNU parallel:

echo "/path/to/file" | parallel 'md5sum {} & sha256sum {}'

Alternatively, just run one of the two in the background:
md5sum /path/to/file & sha256sum /path/to/file

Or, save the output to different files and run multiple jobs in the background:
for file in *; do
    md5sum "$file" > "$file".md5 &
    sha256sum "$file" > "$file".sha &
done

That will launch as many md5sum and sha256sum instances as you have files and they will all run in parallel, saving their output to the corresponding file names. Careful though, this can get heavy if you have many files.

Method 5

Out of curiousity whether a multi-threaded Python script would reduce the running time, I created this digest.py script which uses threading.Thread, threading.Queue and hashlib to calculate the hashes for multiple files.

The multi-threaded Python implementation is indeed slightly faster than using pee with coreutils. Java on the other hand is… meh. The results are available in this commit message:

For comparison, for a file of 2.3 GiB (min/avg/max/sd secs for n=10):

  • pee sha256sum md5sum < file: 16.5/16.9/17.4/.305
  • python3 digest.py -sha256 -md5 < file: 13.7/15.0/18.7/1.77
  • python2 digest.py -sha256 -md5 < file: 13.7/15.9/18.7/1.64
  • jacksum -a sha256+md5 -F ‘#CHECKSUM{i} #FILENAME’: 32.7/37.1/50/6.91

The hash output is compatible with output produced by coreutils. Since the length is dependent on the hashing algorithm, this tool does not print it. Usage (for comparison, pee was also added):

$ ./digest.py -sha256 -md5 digest.py
c217e5aa3c3f9cfaca0d40b1060f6233297a3a0d2728dd19f1de3b28454975f2  digest.py
b575edf6387888a68c93bf89291f611c  digest.py
$ ./digest.py -sha256 -md5 <digest.py
c217e5aa3c3f9cfaca0d40b1060f6233297a3a0d2728dd19f1de3b28454975f2  -
b575edf6387888a68c93bf89291f611c  -
$ pee sha256sum md5sum <digest.py
c217e5aa3c3f9cfaca0d40b1060f6233297a3a0d2728dd19f1de3b28454975f2  -
b575edf6387888a68c93bf89291f611c  -

Method 6

Try RHash

Try RHash.

There are packages for
Cygwin, Debian.

Example

$ echo foo | rhash --md5 --sha1 --bsd -
MD5   ((stdin)) = d3b07384d113edec49eaa6238ad5ff00
SHA1  ((stdin)) = f1d2d2f924e986ac86fdf7b36c94bcdf32beec15

Nice to know: lotsa hashes

If you wanna go crazy: Try the --all option to get ALL supported hashes (and the --bsd formatting option to know what these hashes are):

$ echo foo | rhash --all --bsd - | sort
AICH  ((stdin)) = 6hjnf6je5gdkzbx566zwzff434zl53av
BTIH  ((stdin)) = 22a9c158a3ea04608f0e6ea826e3188c773eb4dd
CRC32 ((stdin)) = 7e3265a8
CRC32C ((stdin)) = 9626347b
ED2K  ((stdin)) = 3ee037f347c64cc372ad18857b0db91f
EDON-R256 ((stdin)) = 747b550af4c4916340680669f885ec391addf22cece025d1cb11df978401793a
EDON-R512 ((stdin)) = 521ec4b41abb75a54969c8070c3558b7f3981833165fd208d3b48de2bc23b64fa2a1d80ea94d87b176ecb99c8495f9ee19307c9ad54c23f37e034579b6ced4d8
GOST  ((stdin)) = eb9382405525bf1cc8403ed621caecfe8339cd7157e383fe9c36782ca0aeab5f
GOST-CRYPTOPRO ((stdin)) = 72e0992f1e7caec2f8406b53d7ed09263fb6df1bae9129731f97a50a9de04115
HAS-160 ((stdin)) = 6bb6e92d882dc41746064f8c2d8e81df02f13f0c
MD4   ((stdin)) = 3ee037f347c64cc372ad18857b0db91f
MD5   ((stdin)) = d3b07384d113edec49eaa6238ad5ff00
RIPEMD-160 ((stdin)) = ec0af898b7b1ab23ccf8c5036cb97e9ab23442ab
SHA1  ((stdin)) = f1d2d2f924e986ac86fdf7b36c94bcdf32beec15
SHA-224 ((stdin)) = e7d5e36e8d470c3e5103fedd2e4f2aa5c30ab27f6629bdc3286f9dd2
SHA-256 ((stdin)) = b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
SHA3-224 ((stdin)) = 5f6b734bdedd9fc2bdf02d18f16ef83bbbb9178aebe5e8f6ae79e9a2
SHA3-256 ((stdin)) = 5218df10c0ebe3b38d74fe0040d13198ac49646a43bad373b91ed887dd734fcf
SHA3-384 ((stdin)) = a4d62fdfee48479a8951de809d9f3604309e8783d754d94c0842c89ddb544ee963bf64063644251e0521ca44aca97350
SHA3-512 ((stdin)) = 6f1b16155d5f87af947270b2202c9432b64ff07880e3bd104a50605bc0f949d4e4bf30cddbb257a7f3a54881429f45efdb43fbe14371f9f7f5cb16789db9175d
SHA-384 ((stdin)) = 8effdabfe14416214a250f935505250bd991f106065d899db6e19bdc8bf648f3ac0f1935c4f65fe8f798289b1a0d1e06
SHA-512 ((stdin)) = 0cf9180a764aba863a67b6d72f0918bc131c6772642cb2dce5a34f0a702f9470ddc2bf125c12198b1995c233c34b4afd346c54a2334c350a948a51b6e8b4e6b6
SNEFRU-128 ((stdin)) = 6bf837fd63236ae6d4a7df110085177c
SNEFRU-256 ((stdin)) = 27f8e3841ee9d88c6a9e5a0b0c02e7d8c3dbffbec3e2d8f22b6419236002aebd
TIGER ((stdin)) = 89c010f8e5ddcf01c7d71c7d8352d5436e40fe5200ca8ce0
TTH   ((stdin)) = a2mppcgs5cpjv6aoap37icdcfv3wyu7pbrec6fy
WHIRLPOOL ((stdin)) = 404818c0ea953193b372a3e72c96b91a53d0d07eb99d8cb8c2aaebf56657e74de2b6a510866283d0501b95aa0ba0ddc3b7669ea5fc9422cc666a953e241d8b9e

Method 7

Jacksum is a free and platform independent utility for computing and verifying checksums, CRCs and hashes (message digests) as well as timestamps of files. (excerpted from jacksum man page)

It is large file aware, it can process filesizes up to 8 Exabytes (= 8,000,000,000 Gigabytes), presupposed your operating system respectively your file system is large file aware, too. (excerpted from http://www.jonelo.de/java/jacksum/)

Usage example:

jacksum -a md5+sha256 -F "#ALGONAME{i} (#FILENAME) = #CHECKSUM{i}" jacksum-testfile

Sample output:
md5 (jacksum-testfile) = d41d8cd98f00b204e9800998ecf8427e
sha256 (jacksum-testfile) = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

On ubuntu, run command apt-get install jacksum to get it.

Alternatively, source codes are available at


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments