Remove entire row in a file if first column is repeated

I have a file containing two columns and 10 million rows. The first column contains many repeated values, but there is a distinct value in column 2. I want to remove the repeated rows and want to keep only one using awk. Note: the file is sorted with values in column 1. For example:

1.123 -4.0
2.234 -3.5
2.234 -3.1
2.234 -2.0
4.432 0.0
5.123 +0.2
8.654 +0.5
8.654 +0.8
8.654 +0.9
.
.
.
.

Expected output

1.123 -4.0
2.234 -3.5
4.432 0.0
5.123 +0.2
8.654 +0.5
.
.
.
.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

A few ways:

  1. awk
    awk '!a[$1]++' file

    This is a very condensed way of writing this:

    awk '{if(! a[$1]){print; a[$1]++}}' file

    So, if the current first field ($1) is not in the a array, print the line and add the 1st field to a. Next time we see that field, it will be in the array and so will not be printed.

  2. Perl
    perl -ane '$k{$F[0]}++ or print' file
    

    or

    perl -ane 'print if !$k{$F[0]}++' file

    This is basically the same as the awk one. The -n causes perl to read the input file line by line and apply the script provided by -e to each line. The -a will automatically split each line on whitespace and save the resulting fields in the @F array. Finally, the first field is added to the %k hash and if it is not already there, the line is printed. The same thing could be written as

    perl -e 'while(<>){
                @F=split(/s+/); 
                print unless defined($k{$F[0]}); 
                $k{$F[0]}++;
             }' file
    
  3. Coreutils
    rev file | uniq -f 1 | rev

    This method works by first reversing the lines in file so that if a line is 12 345 it’ll now be 543 21. We then use uniq -f 1 to ignore the first field, that is to say, the column that 543 is in. There are fields within file. Using uniq here has the effect of filtering out any duplicate lines, keeping only 1 of each. Lastly we put the lines back into their original order with another reverse.

  4. GNU sort (as suggested by @StéphaneChazelas)
    sort -buk1,1

    The -b flag ignores leading whitespace and the -u means print only unique fields. The clever bit is the -k1,1. The -k flag sets the field to sort on. It takes the general format of -k POS1[,POS2] which means only look at fields POS1 through POS2 when sorting. So, -k1,1 means only look at the 1st field.
    Depending on your data, you might want to also add one of these options:

     -g, --general-numeric-sort
          compare according to general numerical value
     -n, --numeric-sort
          compare according to string numerical value

Method 2

If the first column is always 5 characters long, you can simply use uniq:

uniq -w 5 file

If not, use awk:

awk '$1!=a{print $0; a=$1}' file

The first one would definitely be faster with a huge file.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x