Select lines from text file which have ids listed in another file

I use a lot of grep awk sort in my unix shell to work with medium-sized (around 10M-100M lines) tab-separated column text files. In this respect unix shell is my spreadsheet.

But I have one huge problem, that is selecting records given a list of IDs.

Having table.csv file with format idtfootbar... and ids.csv file with list of ids, only select records from table.csv with id present in ids.csv.

kind of https://stackoverflow.com/questions/13732295/extract-all-lines-from-text-file-based-on-a-given-list-of-ids but with shell, not perl.

grep -F obviously produces false positives if ids are variable width.
join is an utility I could never figure out. First of all, it requires alphabetic sorting (my files are usually numerically sorted), but even then I can’t get it to work without complaining about incorrect order and skipping some records. So I don’t like it.
grep -f against file with ^idt-s is very slow when number of ids is large.
awk is cumbersome.

Are there any good solutions for this? Any specific tools for tab-separated files? Extra functionality will be most welcome too.

UPD: Corrected sort -> join

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I guess you meant grep -f not grep -F but you actually need a combination of both and -w:

grep -Fwf ids.csv table.csv

The reason you were getting false positives is (I guess, you did not explain) because if an id can be contained in another, then both will be printed. -w removes this problem and -F makes sure your patterns are treated as strings, not regular expressions. From man grep:

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)
   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

If your false positives are because an ID can be present in a non-ID field, loop through your file instead:

while read pat; do grep -w "^$pat" table.csv; done < ids.csv

or, faster:

xargs -I {} grep "^{}" table.csv < ids.csv

Personally, I would do this in perl though:

perl -lane 'BEGIN{open(A,"ids.csv"); while(<A>){chomp; $k{$_}++}} 
            print $_ if defined($k{$F[0]}); ' table.csv

Method 2

The join utility is what you want. It does require the input files to be lexically sorted.

Assuming your shell is bash or ksh:

join -t $'t' <(sort ids.csv) <(sort table.csv)

Without needing to sort, the usual awk solution is

awk -F 't' 'NR==FNR {id[$1]; next} $1 in id' ids.csv table.csv

Method 3

The answers to this SO question helped me get around the niggles with join. Essentially, when you sort the file in preparation to send it to join, you need to make sure you’re sorting based on the column you’re joining on. So if that’s the first one, you need to tell it what the separator character is in the file and that you want it to sort on the first field (and only the first field). Otherwise if the first field has variable widths (for example), your separators and possibly other fields may start affecting the sort order.

So, use the -t option of sort to specify your separating character, and use the -k option to specify the field (remembering that you need a start and end field – even if it’s the same – or it’ll sort from that character to the end of the line).

So for a tab-separated file like in this question, the following should work (with thanks to glenn’s answer for structure):

join -t$'t' <(sort -d ids.csv) <(sort -d -t$'t' -k1,1 table.csv) > output.csv

(For reference, the -d flag means dictionary sort. You might also want to use the -b flag to ignore leading whitespace, see man sort and man join).

As a more general example, suppose you’re joining two comma-separated files – input1.csv on the third column and input2.csv on the fourth. You could use

join -t, -1 3 -2 4 <(sort -d -t, -k3,3 input2.csv) <(sort -d -t, -k4,4 input2.csv) > output.csv

Here the -1 and -2 options specify which fields to join on in the first and second input files respectively.

Method 4

You can also use ruby to do something similar:

ruby -pe 'File.open("id.csv").each { |i| puts i if i =~ /$_/ }' table.csv

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating