Keeping unique rows based on information from 2 of three columns

Suppose you have a file like this:

NW_006521251.1  428 84134
NW_006521251.1  511 84135
NW_006521038.1  202 84155
NW_006521038.1  1743 84153
NW_006521038.1  1743 84154
NW_006520495.1  198 84159
NW_006520086.1  473 84178
NW_006520086.1  511 84180

I want to keep the unique rows based on columns 1 and 2 (i.e. not just column two as this number may repeat under a different label in column one).

Such that I get this as output (removes the second repeat of NW_006521038.1 1743 from the list):

    NW_006521251.1  428 84134
    NW_006521251.1  511 84135
    NW_006521038.1  202 84155
    NW_006521038.1  1743 84153
    NW_006520495.1  198 84159
    NW_006520086.1  473 84178
    NW_006520086.1  511 84180

Is there a way to do this with awk?
Using uniq file doesn’t work.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

There is a “famous” awk idiom for exactly this. You want to do:

awk '!seen[$1,$2]++' file

That creates an associative array “seen” with the 2 columns as the key. Use the post-increment operator so that, for the first time you encounter that key, the value is zero. The use the negation operator for a “true” result the first time you see the key.

Method 2

If you don’t mind that the output is sorted:

sort -u -k1,2 file
  • -u – unique
  • -k1,2 – use fields 1 and 2 together as the key


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x