I have a CSV file users.csv with a list of userNames, userIDs, and other data:
username, userid, sidebar_side, sidebar_colour "John Lennon", 90123412, "left", "blue" "Paul McCartny", 30923833, "left", "black" "Ringo Starr", 77392318, "right", "blue" "George Harrison", 72349482, "left", "green"
In another file toremove.txt I have a list of userIDs:
30923833 77392318
Is there a clever, efficient way to remove all the rows from the users.csv file which contain the IDs in toremove.txt? I have written a simple Python app to parse the two files and write to a new file only those lines that are not found in toremove.txt, but it is extraordinarily slow. Perhaps some sed or awk magic can help here?
This is the desired result, considering the examples above:
username, userid, sidebar_side, sidebar_colour "John Lennon", 90123412, "left", "blue" "George Harrison", 72349482, "left", "green"
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
With grep, you can do:
$ grep -vwF -f toremove.txt users.txt username, userid, sidebar_side, sidebar_colour "John Lennon", 90123412, "left", "blue" "George Harrison", 72349482, "left", "green"
With awk:
$ awk -F'[ ,]' 'FNR==NR{a[$1];next} !($4 in a)' toremove.txt users.txt
username, userid, sidebar_side, sidebar_colour
"John Lennon", 90123412, "left", "blue"
"George Harrison", 72349482, "left", "green"
Method 2
Here’s Gnouc’s awk answer, modified to be space-blind:
awk -F, 'FNR==NR{a[$1];next} !(gensub("^ *","",1,$2) in a)' toremove.txt users.csv
Since it uses only commas (and not spaces) as delimiters,
$1 is "John Lennon", $2 is 90123412 (with a leading space), etc.
So we use gensub to remove any number of leading spaces from $2
before we check whether it (the userid) was in the toremove.txt file.
Method 3
OK a ruby way: if you have a list of strings in a file, and you want to remove all lines from another file that even contain any string in the first file (in this case removing “file2” from “file1”) ruby file:
b=File.read("file2").split # subtract this one out
remove_regex = Regexp.new(b.join('|'))
File.open("file1", "r").each_line do |line|
if line !~ remove_regex
puts line
end
end
unfortunately with a large “to remove” file this seems to degrade complexity-wise to O(N^2) (my assumption is the regexp has a lot of work to do), but still might be useful to someone out there (if you want more than removing full lines). It might be faster in certain cases.
Another option if you’re going for speed is to use the same hash checking mechanism, but to carefully “parse” the line for strings that might match, then comparing them with your hash.
In ruby, might look like this:
b=File.read("file2").split # subtract this one out
hash={}
for line in b
hash[line] = 1
end
ARGF.each_line do |line|
ok = true
for number in line.scan(/d{9}/)
if hash.key? number
ok=false
end
end
if (ok)
puts line
end
end
See also Scott’s answer, its similar to the awk answers proposed hither thither, and avoids O(N^2) complexity (phew).
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0