Remove all lines in file A which contain the strings in file B

I have a CSV file users.csv with a list of userNames, userIDs, and other data:

username, userid, sidebar_side, sidebar_colour
"John Lennon", 90123412, "left", "blue"
"Paul McCartny", 30923833, "left", "black"
"Ringo Starr", 77392318, "right", "blue"
"George Harrison", 72349482, "left", "green"

In another file toremove.txt I have a list of userIDs:

30923833
77392318

Is there a clever, efficient way to remove all the rows from the users.csv file which contain the IDs in toremove.txt? I have written a simple Python app to parse the two files and write to a new file only those lines that are not found in toremove.txt, but it is extraordinarily slow. Perhaps some sed or awk magic can help here?

This is the desired result, considering the examples above:

username, userid, sidebar_side, sidebar_colour
"John Lennon", 90123412, "left", "blue"
"George Harrison", 72349482, "left", "green"

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

With grep, you can do:

$ grep -vwF -f toremove.txt users.txt 
username, userid, sidebar_side, sidebar_colour
"John Lennon", 90123412, "left", "blue"
"George Harrison", 72349482, "left", "green"

With awk:

$ awk -F'[ ,]' 'FNR==NR{a[$1];next} !($4 in a)' toremove.txt users.txt 
username, userid, sidebar_side, sidebar_colour
"John Lennon", 90123412, "left", "blue"
"George Harrison", 72349482, "left", "green"

Method 2

Here’s Gnouc’s awk answer, modified to be space-blind:

awk -F, 'FNR==NR{a[$1];next} !(gensub("^ *","",1,$2) in a)' toremove.txt users.csv

Since it uses only commas (and not spaces) as delimiters,
$1 is "John Lennon", $2 is  90123412 (with a leading space), etc. 
So we use gensub to remove any number of leading spaces from $2
before we check whether it (the userid) was in the toremove.txt file.

Method 3

OK a ruby way: if you have a list of strings in a file, and you want to remove all lines from another file that even contain any string in the first file (in this case removing “file2” from “file1”) ruby file:

b=File.read("file2").split # subtract this one out
remove_regex = Regexp.new(b.join('|'))
File.open("file1", "r").each_line do |line|
  if line !~ remove_regex
    puts line
  end
end

unfortunately with a large “to remove” file this seems to degrade complexity-wise to O(N^2) (my assumption is the regexp has a lot of work to do), but still might be useful to someone out there (if you want more than removing full lines). It might be faster in certain cases.

Another option if you’re going for speed is to use the same hash checking mechanism, but to carefully “parse” the line for strings that might match, then comparing them with your hash.

In ruby, might look like this:

b=File.read("file2").split # subtract this one out
hash={}
for line in b
  hash[line] = 1
end

ARGF.each_line do |line|
  ok = true
  for number in line.scan(/d{9}/)
    if hash.key? number
      ok=false
    end
  end
  if (ok)
    puts line
  end
end

See also Scott’s answer, its similar to the awk answers proposed hither thither, and avoids O(N^2) complexity (phew).


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x