Efficient way to print lines from a massive file using awk, sed, or something else?

If I had a plain-text file containing 8 million lines and I want to print out lines 4,000,010 to 4,000,000 to the screen, which would be more efficient: awk or sed?

There is no pattern to the text, and unfortunately, a database isn’t an option. I know this isn’t ideal, I’m just curious on which one would complete the task quicker.

Or maybe there is even a better alternative to sed or awk?

Contents hide

Answers:

Method 1

tail:

head:

sed:

Perl:

awk:

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Neither, use tail or head instead:

$ time tail -n 4000001 foo | head -n 11
real    0m0.039s
user    0m0.032s
sys     0m0.004s

$ time head -n 4000010 foo | tail -n 11
real    0m0.055s
user    0m0.064s
sys     0m0.036s

tail is in fact consistently faster. I ran both commands 100 times and calculated their average:

tail:

real    0.03962
user    0.02956
sys     0.01456

head:

real    0.06284
user    0.07356
sys     0.07244

I imagine tail is faster because though it has to seek all the way to line 4e10, it does not actually print anything until it gets there while head will print everything until line 4e10 + 10.

Compare to some other methods sorted in order of time:

sed:

$ time sed -n 4000000,4000011p;q foo
real    0m0.312s
user    0m0.236s
sys     0m0.072s

Perl:

$ time perl -ne 'next if $.<4000000; print; exit if $.>=4000010' foo 
real    0m1.000s
user    0m0.936s
sys     0m0.064s

awk:

$ time awk '(NR>=4000000 && NR<=4000010){print} (NR==4000010){exit}' foo 
real    0m0.955s
user    0m0.868s
sys     0m0.080s

Basically, the rule is the less you parse, the faster you are. Treating the input as a stream of data which only needs to be printed to the screen (as tail does) will always be the fastest way.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating