If I had a plain-text file containing 8 million lines and I want to print out lines 4,000,010 to 4,000,000 to the screen, which would be more efficient: awk or sed?
There is no pattern to the text, and unfortunately, a database isn’t an option. I know this isn’t ideal, I’m just curious on which one would complete the task quicker.
Or maybe there is even a better alternative to sed or awk?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Neither, use tail or head instead:
$ time tail -n 4000001 foo | head -n 11 real 0m0.039s user 0m0.032s sys 0m0.004s $ time head -n 4000010 foo | tail -n 11 real 0m0.055s user 0m0.064s sys 0m0.036s
tail is in fact consistently faster. I ran both commands 100 times and calculated their average:
tail:
real 0.03962 user 0.02956 sys 0.01456
head:
real 0.06284 user 0.07356 sys 0.07244
I imagine tail is faster because though it has to seek all the way to line 4e10, it does not actually print anything until it gets there while head will print everything until line 4e10 + 10.
Compare to some other methods sorted in order of time:
sed:
$ time sed -n 4000000,4000011p;q foo real 0m0.312s user 0m0.236s sys 0m0.072s
Perl:
$ time perl -ne 'next if $.<4000000; print; exit if $.>=4000010' foo real 0m1.000s user 0m0.936s sys 0m0.064s
awk:
$ time awk '(NR>=4000000 && NR<=4000010){print} (NR==4000010){exit}' foo
real 0m0.955s
user 0m0.868s
sys 0m0.080s
Basically, the rule is the less you parse, the faster you are. Treating the input as a stream of data which only needs to be printed to the screen (as tail does) will always be the fastest way.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0