Filter file by line number

Given a file L with one non-negative integer per line and text file F, what would be a fast way to keep only those lines in F, whose line number appears in file L?

Example:

$ cat L.txt
1
3

$ cat F.txt
Hello World
Hallo Welt
Hola mundo

$ command-in-question -x L.txt F.txt
Hello World
Hola mundo

I’m looking for a command that can handle a file L with 500 million or more entries; file L is sorted numerically.

Note: I’m halfway through an implementation for a command-in-question but I just wondered, whether one might be able to use some Unix tools here as well.


Update: Thank for all the answers, I learned a lot today! I would like to accept more one answer, but that’s not possible.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

grep -n | sort | sed | cut

(   export LC_ALL=C
    grep -n ''   | sort -t:  -nmk1,1 ./L - |
    sed /:/d;n  | cut  -sd: -f2-
)   <./F

That should work pretty quickly (some timed tests are included below) with input of any size. Some notes on how:

  • export LC_ALL=C
    • Because the point of the following operation is to get the entire file of ./F stacked up inline with its ./L lineno’s file, the only characters we’ll really need to worry about are ASCII [0-9]digits and the :colon.
    • For that reason it is more simple to worry about finding those 11 characters in a set of 128 possibles than it is if UTF-8 is otherwise involved.
  • grep -n ''
    • This inserts the string LINENO: into the head of every line in stdin – or <./F.
  • sort -t: -nmk1,1 ./L -
    • sort neglects to sort its input files at all, and instead (correctly) presumes they are presorted and -merges them in -numerically sorted order, ignoring basically anything beyond any possible -k1,1st occurring -t:colon character anyway.
    • While this may require some temp space to do (depending on how far apart some sequences may occur), it will not require much as compared to a proper sort, and it will be very fast because it involves zero backtracking.
    • sort will output a single stream where any lineno’s in ./L will immediately precede the corresponding lines in ./F. ./L‘s lines always come first because they are shorter.
  • sed /:/d;n
    • If the current line matches a /:/colon delete it from output. Else, auto-print the current and next line.
    • And so sed prunes sort‘s output to only sequential line pairs which do not match a colon and the following line – or, to only a line from ./L and then the next.
  • cut -sd: -f2-
    • cut -suppresses from output those of its input lines which do not contain at least one of its -d:elimiter strings – and so ./L‘s lines are pruned completely.
    • For those lines which do, their first : colon-delimited -field is cut away – and so goes all of grep‘s inserted lineno’s.

small input test

seq 5 | sed -ne'2,3!w /tmp/L
        s/.*/a-z && 0-9/p' >/tmp/F

…generates 5 lines of sample input. Then…

(   export LC_ALL=C; </tmp/F 
    grep -n ''   | sort -t:  -nmk1,1 ./L - |
    sed /:/d;n  | cut  -sd: -f2-
)|  head - /tmp[FL]

…prints…

==> standard input <==
a-z 1& 0-9
a-z 4& 0-9
a-z 5& 0-9

==> /tmp/F <==
a-z 1& 0-9
a-z 2& 0-9
a-z 3& 0-9
a-z 4& 0-9
a-z 5& 0-9

==> /tmp/L <==
1
4
5

bigger timed tests

I created a couple of pretty large files:

seq 5000000 | tee /tmp/F |
sort -R | head -n1500000 |
sort -n >/tmp/L

…which put 5mil lines in /tmp/F and 1.5mil randomly selected lines of that into /tmp/L. I then did:

time 
(   export LC_ALL=C
    grep -n ''   | sort -t:  -nmk1,1 ./L - |
    sed /:/d;n  | cut  -sd: -f2-
)   <./F |wc - l

It printed:

1500000
grep -n '' 
    0.82s user 0.05s system 73% cpu 1.185 total
sort -t: -nmk1,1 /tmp/L - 
    0.92s user 0.11s system 86% cpu 1.185 total
sed /:/d;n 
    1.02s user 0.14s system 98% cpu 1.185 total
cut -sd: -f2- 
    0.79s user 0.17s system 80% cpu 1.184 total
wc -l 
    0.05s user 0.07s system 10% cpu 1.183 total

(I added the backslashes there)

Among the solutions currently offered here, this is the fastest of all of them but one when pitted against the dataset generated above on my machine. Of the others only one came close to contending for second-place, and that is meuh’s perl here.

This is by no means the original solution offered – it has dropped a third of its execution time thanks to advice/inspiration offered by others. See the post history for slower solutions (but why?).

Also, it is worth noting that some other answers might very well contend better if it were not for the multi-cpu architecture of my system and the concurrent execution of each of the processes in that pipeline. They all work at the same time – each on its own processor core – passing around the data and doing their small part of the whole. It’s pretty cool.

but the fastest solution is…

But it is not the fastest solution. The fastest solution offered here, hands-down, is the C program. I called it cselect. After copying it to my X clipboard, I compiled it like:

xsel -bo | cc -xc - -o cselect

I then did:

time 
    ./cselect /tmp/L /tmp/F |
wc -l

…and the results were…

1500000
./cselect /tmp/L /tmp/F  
    0.50s user 0.05s system 99% cpu 0.551 total
wc -l 
    0.05s user 0.05s system 19% cpu 0.551 total

Method 2

I’d use awk, but not store the whole content of L.txt in memory and do unnecessary hash look ups ;-).

list=L.txt file=F.txt
LIST="$list" awk '
  function nextline() {
    if ((getline n < list) <=0) exit
  }
  BEGIN{
    list = ENVIRON["LIST"]
    nextline()
  }
  NR == n {
    print
    nextline()
  }' < "$file"

Method 3

I’d use awk:

awk 'NR==FNR {a[$1]; next}; FNR in a' L.txt F.txt

Update: I’ve done performance measures; it seems this version scales even better with very large data sets (as is the case with the stated requirements), since the comparison is very fast and overcompensates the effort necessary to build up the hash table.

Method 4

With C omitting meaningful error messages:

#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[]) {

    FILE *L;
    FILE *F;

    unsigned int to_print;
    unsigned int current = 0;
    char *line = NULL;
    size_t len = 0;

    if ((L = fopen(argv[1], "r")) == NULL) {
        return 1;
    } else if ((F = fopen(argv[2], "r")) == NULL) {
        fclose(L);
        return 1;
    } else {

        while (fscanf(L, "%u", &to_print) > 0) {
            while (getline(&line, &len, F) != -1 && ++current != to_print);
            if (current == to_print) {
                printf("%s", line);
            }
        }

        free(line);
        fclose(L);
        fclose(F);
        return 0;
    }
}

Method 5

Just for completeness: we can merge the excellent awk script in the answer by Stéphane Chazelas, and the perl script in the answer by kos but without keeping the entire list in memory, in the hope that perl might be faster than awk. (I’ve changed the order of args to match the original question).

#!/usr/bin/env perl
use strict;

die "Usage: $0 l fn" if $#ARGV+1 != 2;
open(L,$ARGV[0]) or die "$ARGV[0]: $!";
open(F,$ARGV[1]) or die "$ARGV[1]: $!";

while(my $number = <L>){
    #chop $number;
    while (<F>) {
        if($. == $number){
            print;
            last;
        }
    }
}

Method 6

I wrote a simple Perl script to do that:

Usage: script.pl inputfile_f inputfile_f

#!/usr/bin/env perl

$number_arguments = $#ARGV + 1;
if ($number_arguments != 2) {
    die "Usage: script.pl inputfile_f inputfile_ln";
}

open($f, '<', $ARGV[0])
    or die "$ARGV[0]: Not foundn";
open($l, '<', $ARGV[1])
    or die "$ARGV[1]: Not foundn";

@line_numbers = <$l>;

while ($line = <$f>) {
    $count_f ++;
    if ($count_f == @line_numbers[$count_l]) {
        print $line;
        $count_l ++;
    }
}
  • Loads F.txt
  • Loads L.txt
  • Stores each line of L.txt into an array
  • Reads F.txt line by line, tracking its current line number and the current array index; increases the F.txt current line number; if the F.txt current line number matches the content of the array at the current array index, it prints the current line and increases the index

Cost and complexity considerations:

Considering the cost to make the assignments, the cost to make the comparisons and the cost to print the lines, given N1 as the number of lines in F.txt and N2 as the number of lines in L.txt, the while loop runs at most N1 times, leading to 2N1 + N2 assignments (obviously assuming N1 > N2), to 2N1 comparisons and to N2 prints; given as equal the cost of each operation, the total cost to run the while loop is 4N1+2N2, which leads to a complexity of the script of O(N).

Test on a 10-million-lines input file:

Using a 10-million-lines F.txt file containing random 50-characters-long lines and a 10-million-lines L.txt file containing numbers from 1 to 10000000 (worst case scenario):

~/tmp$ for ((i=0; i<3; i++)); do time ./script.pl F.txt L.txt > output; done

real    0m15.628s
user    0m13.396s
sys 0m2.180s

real    0m16.001s
user    0m13.376s
sys 0m2.436s

real    0m16.153s
user    0m13.564s
sys 0m2.304s

Method 7

This perl solution is faster than the other awk or perl solutions by 20% or so, but oviously not as fast as the solution in C.

perl -e '
  open L, shift or die $!;
  open F, shift or die $!;
  exit if ! ($n = <L>);
  while (1) {
    $_ = <F>;
    next if $. != $n;
    print;
    exit if ! ($n = <L>);
  }
' -- L F

Method 8

cat <<! >L.txt
1
3
!

cat <<! >F.txt
Hello World
Hallo Welt
Hola mundo
!

cmd(){
 L=$1 F=$2
 cat -n $F |
 join $L - |
 sed 's/[^ ]* //'
}

cmd L.txt F.txt
Hello World
Hola mundo

Since L.txt is sorted you can use join. Just number each line in F.txt,
join the two files, then remove the line number. No large intermediate files
are needed.

Actually, the above will mangle your data lines by replacing all white space by a single space. To keep the line intact you need to choose as a delimiter some character that does not appear in your data, eg “|”. The cmd is then

cmd(){
 L=$1 F=$2
 cat -n $F |
 sed 's/^ *//;s/t/|/' |
 join -t'|' $L - |
 sed 's/[^|]*|//'
}

The first sed removes leading spaces from the “cat -n” output and replaces the tab. The second sed removes the line number and “|”.

Method 9

For completeness, another attempt at the join solution:

sed -r 's/^/00000000000000/;s/[0-9]*([0-9]{15})/1/' /tmp/L | join <( nl -w15 -nrz /tmp/F ) - | cut -d' ' -f2-

This works by formatting the the line-number column that join works on as fixed length with leading zeros, so that the numbers are always 15 digits long. This circumvents the problem of join not liking normal numerical sort order, as the column has effectively now been forced to be dictionary sort. nl is used to add line numbers in this format to F.txt. Unfortunately sed needs to be used to reformat the numbering in L.txt.

This approach seems to work OK on the test data generated using @mikeserv’s method. But it is still very slow – the c solution is 60x faster on my machine. about 2/3 of the time is spent in sed and 1/3 in join. Perhaps there is a better sed expression…

Method 10

Since the accepted answer is in C, I figured it’s OK to throw a python solution up here:

# Read mask
with open('L.txt', 'r') as f:
    mask = [int(line_num) for line_num in f.read().splitlines()]

# Filter input file
filtered_lines = []
with open('F.txt', 'r') as f:
    for i, line in enumerate(f.read().splitlines()):
        if (i+1) in mask:
            filtered_lines.append(line)

# Write newly filtered file
with open('F_filtered.txt', 'w') as f:
    for line in filtered_lines:
        f.write('%sn' % line)

If using an external library like numpy, a solution would look even more elegent:

import numpy as np

with open('L.txt', 'r') as f:
    mask = np.array([int(line_num)-1 for line_num in f.read().splitlines()])

with open('F.txt', 'r') as f:
    lines = np.array(f.read().splitlines())
filtered_lines = lines[mask]

with open('F_filtered.txt', 'w') as f:
    for line in filtered_lines:
        f.write('%sn' % line)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x