Why is iterating over a file twice faster than reading it into memory and computing twice?

I’m comparing the following

tail -n 1000000 stdout.log | grep -c '"success": true'
tail -n 1000000 stdout.log | grep -c '"success": false'

with the following

log=$(tail -n 1000000 stdout.log)
echo "$log" | grep -c '"success": true'
echo "$log" | grep -c '"success": false'

and surprisingly the second takes almost 3 times longer than the first. It should be faster, shouldn’t it?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I’ve done the following test and on my system the resulting difference is about 100 times longer for the second script.

My file is a strace output called bigfile

$ wc -l bigfile.log 
1617000 bigfile.log

Scripts

<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="027a766b636c42616e636477686b77">[email protected]</a>:~/tmp$ cat p1.sh
tail -n 1000000 bigfile.log | grep '"success": true' | wc -l
tail -n 1000000 bigfile.log | grep '"success": false' | wc -l

<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cbb3bfa2aaa58ba8a7aaadbea1a2be">[email protected]</a>:~/tmp$ cat p2.sh
log=$(tail -n 1000000 bigfile.log)
echo "$log" | grep '"success": true' | wc -l
echo "$log" | grep '"success": true' | wc -l

I don’t actually have any matches for the grep so nothing is written to the last pipe through to wc -l

Here are the timings:

<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="98e0ecf1f9f6d8fbf4f9feedf2f1ed">[email protected]</a>:~/tmp$ time bash p1.sh
0
0

real    0m0.381s
user    0m0.248s
sys 0m0.280s
<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a0d8d4c9c1cee0c3ccc1c6d5cac9d5">[email protected]</a>:~/tmp$ time bash p2.sh
0
0

real    0m46.060s
user    0m43.903s
sys 0m2.176s

So I ran the two scripts again via the strace command

strace -cfo p1.strace bash p1.sh
strace -cfo p2.strace bash p2.sh

Here are the results from the traces:

$ cat p1.strace 
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.24    0.508109       63514         8         2 waitpid
  1.61    0.008388           0     84569           read
  1.08    0.005659           0     42448           write
  0.06    0.000328           0     21233           _llseek
  0.00    0.000024           0       204       146 stat64
  0.00    0.000017           0       137           fstat64
  0.00    0.000000           0       283       149 open
  0.00    0.000000           0       180         8 close
...
  0.00    0.000000           0       162           mmap2
  0.00    0.000000           0        29           getuid32
  0.00    0.000000           0        29           getgid32
  0.00    0.000000           0        29           geteuid32
  0.00    0.000000           0        29           getegid32
  0.00    0.000000           0         3         1 fcntl64
  0.00    0.000000           0         7           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00    0.522525                149618       332 total

And p2.strace

$ cat p2.strace 
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 75.27    1.336886      133689        10         3 waitpid
 13.36    0.237266          11     21231           write
  4.65    0.082527        1115        74           brk
  2.48    0.044000        7333         6           execve
  2.31    0.040998        5857         7           clone
  1.91    0.033965           0    705681           read
  0.02    0.000376           0     10619           _llseek
  0.00    0.000000           0       248       132 open
...
  0.00    0.000000           0       141           mmap2
  0.00    0.000000           0       176       126 stat64
  0.00    0.000000           0       118           fstat64
  0.00    0.000000           0        25           getuid32
  0.00    0.000000           0        25           getgid32
  0.00    0.000000           0        25           geteuid32
  0.00    0.000000           0        25           getegid32
  0.00    0.000000           0         3         1 fcntl64
  0.00    0.000000           0         6           set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00    1.776018                738827       293 total

Analysis

Not surprisingly, in both cases most of the time is spent waiting for a process to complete, but p2 waits 2.63 times longer than p1, and as others have mentioned, you are starting late in p2.sh.

So now forget about the waitpid, ignore the % column and look at the seconds column on both traces.

Largest time p1 spends most of its time in read probably understandably, because there’s a large file to read, but p2 spends 28.82 times longer in read than p1 does. – bash is not expecting to read such a large file into a variable and is probably reading buffer at a time, splitting into lines and and then getting another.

read count p2 is 705k vs 84k for p1, each read requiring a context switch into kernel space and out again. Nearly 10 times the number of reads and context switches.

Time in write p2 spends 41.93 times longer in write than p1

write count p1 does more writes than p2, 42k vs 21k, however they are much faster.

Probably because of the echo of lines into grep as opposed to tail writing buffers.

Further more, p2 spends more time in write than it does in read, p1 is the other way round!

Other factor Look at the number of brk system calls : p2 spends 2.42 times longer breaking than it does reading! In p1 (it doesn’t even register). brk is when the program needs to expand its address space because enough wasn’t allocated initially, this is probably due to bash having to read that file into the variable, and not expecting it to be that large, and as @scai mentioned, if the file gets too large, even that would not work.

tail is probably quite an efficient file reader, because this is what it was designed to do, it probably memmaps the file and scans for line breaks, thus allowing the kernel to optimise the i/o. bash doesn’t isn’t as good both on time spent reading and writing.

p2 spends 44ms and 41ms in clone and execv it’s not a measurable amount for p1. Probably bash reading and creating the variable from tail.

Finally the Totals p1 executes ~ 150k system calls vs p2 740k (4.93 times greater).

Eliminating waitpid, p1 spends 0.014416 seconds executing system calls, p2 0.439132 seconds (30 times longer).

So it appears p2 spends most of the time in user space doing nothing except waiting for the system calls to complete and the kernel to regorganise memory, p1 performs more writes, but is more efficient and causes significantly less system load, and is therefore faster.

Conclusion

I would never try to worry about coding through memory when writing a bash script, that does not mean to say that you do not try to be efficient.

tail is designed to do what it does, it probably memory maps the file so that it is efficient to read and allows the kernel to optimise the i/o.

A better way to optimise your problem might be to first grep for ‘”success”: ‘ lines and then count the trues and falses, grep has a count option which again avoids the wc -l, or even better still, pipe the tail through to awk and count trues and falses concurrently. p2 not only takes long but adds load to the system whilst memory is being shuffled about with brks.

Method 2

On the one hand, the first method calls tail twice, so it has to do more work than the second method which only does this once. On the other hand, the second method has to copy the data into the shell and then back out, so it has to do more work than the first version where tail is directly piped into grep. The first method has an extra advantage on a multi-processor machine: grep can work in parallel with tail, whereas the second method is strictly serialized, first tail, then grep.

So there’s no obvious reason why one should be faster than the other.

If you want to see what’s going on, look at what system calls the shell makes. Try with different shells, too.

strace -t -f -o 1.strace sh -c '
  tail -n 1000000 stdout.log | grep ""success": true" | wc -l;
  tail -n 1000000 stdout.log | grep ""success": false" | wc -l'

strace -t -f -o 2-bash.strace bash -c '
  log=$(tail -n 1000000 stdout.log);
  echo "$log" | grep ""success": true" | wc -l;
  echo "$log" | grep ""success": true" | wc -l'

strace -t -f -o 2-zsh.strace zsh -c '
  log=$(tail -n 1000000 stdout.log);
  echo "$log" | grep ""success": true" | wc -l;
  echo "$log" | grep ""success": true" | wc -l'

With method 1, the main stages are:

tail reads and seek to find its starting point.
tail writes 4096-byte chunks which grep reads as fast as they’re produced.
Repeat the previous step for the second search string.

With method 2, the main stages are:

tail reads and seek to find its starting point.
tail writes 4096-byte chunks which bash reads 128 bytes at a time, and zsh reads 4096 bytes at a time.
Bash or zsh writes 4096-byte chunks which grep reads as fast as they’re produced.
Repeat the previous step for the second search string.

Bash’s 128-byte chunks when reading the output of the command substitution slows it down significantly; zsh comes out about as fast as method 1 for me. Your mileage may vary depending on the CPU type and number, scheduler configuration, versions of the tools involved, and size of the data.

Method 3

Actually the first solution reads the file into memory, too! This is called caching and is automatically done by the operating system.

And as already correctly explained by mikeserv the first solution exectutes grep while the file is being read whereas the second solution executes it after the file has been read by tail.

So the first solution is faster because of various optimizations. But this doesn’t always have to be true. For really large files which the OS decides not to cache the second solution could become faster. But note that for even larger files which won’t fit into your memory the second solution won’t work at all.

Method 4

I think the main difference is very simply that echo is slow. Consider this:

$ time (tail -n 1000000 foo | grep 'true' | wc -l; 
        tail -n 1000000 foo | grep 'false' | wc -l;)
666666
333333

real    0m0.999s
user    0m1.056s
sys     0m0.136s

$ time (log=$(tail -n 1000000 foo); echo "$log" | grep 'true' | wc -l; 
                                    echo "$log" | grep 'false' | wc -l)
666666
333333

real    0m4.132s
user    0m3.876s
sys     0m0.468s

$ time (tail -n 1000000 foo > bb;  grep 'true' bb | wc -l; 
                                   grep 'false' bb | wc -l)
666666
333333

real    0m0.568s
user    0m0.512s
sys     0m0.092s

As you can see above, the time consuming step is printing the data. If you simply redirect to a new file and grep through that it is much faster when only reading the file once.

And as requested, with a here string:

 $ time (log=$(tail -n 1000000 foo); grep 'true' <<< $log | wc -l; 
                                     grep 'false' <<< $log | wc -l  )
1
1

real    0m7.574s
user    0m7.092s
sys     0m0.516s

This one is even slower, presumably because the here string is concatenating all data onto one long line and that will slow down the grep:

$ tail -n 1000000 foo | (time grep -c 'true')
666666

real    0m0.500s
user    0m0.472s
sys     0m0.000s

$ tail -n 1000000 foo | perl -pe 's/n/ /' | (time grep -c 'true')
1

real    0m1.053s
user    0m0.048s
sys     0m0.068s

If the variable is quoted so that no splitting occurs, things are a little faster:

 $ time (log=$(tail -n 1000000 foo); grep 'true' <<< "$log" | wc -l; 
                                     grep 'false' <<< "$log" | wc -l  )
666666
333333

real    0m6.545s
user    0m6.060s
sys     0m0.548s

But still slow because the rate limiting step is printing the data.

Method 5

I had a go at this as well… First, I built the file:

printf '"success": "true"
        "success": "true"
        "success": "false"
        %.0b' `seq 1 500000` >|/tmp/log

If you run the above yourself, you should come up with 1.5million lines in /tmp/log with 2:1 ratio of "success": "true" lines to "success": "false" lines.

The next thing I did was run some tests. I ran all of the tests through a proxy sh so time would only have to watch a single process – and could therefore show a single result for the entire job.

This seems to be the fastest, even though it adds a second file descriptor and tee, though I think I can explain why:

    time sh <<-CMD
        . <<HD /dev/stdin | grep '"success": "true"' | wc -l
            tail -n 1000000 /tmp/log | { tee /dev/fd/3 |
                grep '"success": "false"' |
                    wc -l 1>&2 & } 3>&1 &
        HD
    CMD
666666
333334
sh <<<''  0.11s user 0.08s system 84% cpu 0.224 total

Here’s your first:

    time sh <<CMD
        tail -n 1000000 /tmp/log | grep '"success": "true"' | wc -l
        tail -n 1000000 /tmp/log | grep '"success": "false"' | wc -l
    CMD

666666
333334
sh <<<''  0.31s user 0.17s system 148% cpu 0.323 total

And your second:

    time sh <<CMD
        log=$(tail -n 1000000 /tmp/log)
        echo "$log" | grep '"success": "true"' | wc -l
        echo "$log" | grep '"success": "false"' | wc -l
    CMD
666666
333334
sh <<<''  2.12s user 0.46s system 108% cpu 2.381 total

You can see that in my tests there was more than a 3*difference in speed when reading it into a variable as you did.

I think part of that is that a shell variable has to be split and handled by the shell when it’s being read – it’s not a file.

A here-document on the other hand, for all intents and purposes, is a file – a file descriptor, anyway. And as we all know – Unix works with files.

What’s most interesting to me about here-docs is that you can manipulate their file-descriptors – as a straight |pipe– and execute them. This is very handy as it allows you a little more freedom in pointing your |pipe where you’d like.

I had to tee the tail because the first grep eats the here-doc |pipe and there’s nothing left for the second to read. But since I |piped it into /dev/fd/3 and picked it up again to pass to >&1 stdout, it didn’t much matter. If you use grep -c as many others recommend:

    time sh <<-CMD
        . <<HD /dev/stdin | grep -c '"success": "true"'
            tail -n 1000000 /tmp/log | { tee /dev/fd/3 |
                grep -c '"success": "false"' 1>&2 & } 3>&1 &
        HD
    CMD
666666
333334
sh <<<''  0.07s user 0.04s system 62% cpu 0.175 total

It’s even faster.

But when I run it without . sourcing the heredoc I can’t successfully background the first process to run them fully concurrently. Here it is without fully backgrounding it:

    time sh <<CMD
        tail -n 1000000 /tmp/log | { tee /dev/fd/3 |
            grep -c '"success": "true"' 1>&2 & } 3>&1 |
                grep -c '"success": "false"'
    CMD
666666
333334
sh <<<''  0.10s user 0.08s system 109% cpu 0.165 total

But when I add the &:

    time sh <<CMD
        tail -n 1000000 /tmp/log | { tee /dev/fd/3 |
            grep -c '"success": "true"' 1>&2 & } 3>&1 & |
                grep -c '"success": "false"'
    CMD
sh: line 2: syntax error near unexpected token `|'

Still, the difference appears to be only a few hundredths of a second, at least for me, so take it as you will.

Anyway, the reason it runs faster with tee is because both greps run at the same time with only one invocation of tail. tee duplicates the file for us and splits it off to the second grep process all in-stream – everything runs at once from beginning to end, so they all finish up around the same time, too.

So going back to your first example:

    tail | grep | wc #wait til finished
    tail | grep | wc #now we're done

And your second:

    var=$( tail ) ; #wait til finished
    echo | grep | wc #wait til finished
    echo | grep | wc #now we're done

But when we split our input and run our processes concurrently:

          3>&1  | grep #now we're done
              /        
    tail | tee  #both process together
                
          >&1   | grep #now we're done

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating