Delete duplicate lines pairwise?

I encountered this use case today. It seems simple at first glance, but fiddling around with sort, uniq, sed and awk revealed that it’s nontrivial.

How can I delete all pairs of duplicate lines? In other words, if there is an even number of duplicates of a given line, delete all of them; if there is an odd number of duplicate lines, delete all but one. (Sorted input can be assumed.)

A clean elegant solution is preferable.

Example input:

a
a
a
b
b
c
c
c
c
d
d
d
d
d
e

Example output:

a
d
e

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Method 9

Method 10

Method 11

Method 12

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I worked out the sed answer not long after I posted this question; no one else has used sed so far so here it is:

sed '$!N;/^(.*)n1$/d;P;D'

A little playing around with the more general problem (what about deleting lines in sets of three? Or four, or five?) provided the following extensible solution:

sed -e ':top' -e '$!{/n/!{N;b top' -e '};};/^(.*)n1$/d;P;D' temp

Extended to remove triples of lines:

sed -e ':top' -e '$!{/n.*n/!{N;b top' -e '};};/^(.*)n1n1$/d;P;D' temp

Or to remove quads of lines:

sed -e ':top' -e '$!{/n.*n.*n/!{N;b top' -e '};};/^(.*)n1n1n1$/d;P;D' temp

sed has an additional advantage over most other options, which is its ability to truly operate in a stream, with no more memory storage needed than the actual number of lines to be checked for duplicates.

As cuonglm pointed out in the comments, setting the locale to C is necessary to avoid failures to properly remove lines containing multi-byte characters. So the commands above become:

LC_ALL=C sed '$!N;/^(.*)n1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/n/!{N;b top' -e '};};/^(.*)n1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/n.*n/!{N;b top' -e '};};/^(.*)n1n1$/d;P;D' temp
# Etc.

Method 2

It’s not very elegant, but it’s as simple as I can come up with:

uniq -c input | awk '{if ($1 % 2 == 1) { print substr($0, 9) }}'

The substr() just trims off the uniq output. That’ll work until you have more than 9,999,999 duplicates of a line (in which case uniq’s output may spill over 9 characters).

Method 3

Give a try to this awk script below:

#!/usr/bin/awk -f
{
  if ((NR!=1) && (previous!=$0) && (count%2==1)) {
    print previous;
    count=0;
  }
  previous=$0;
  count++;
}
END {
  if (count%2==1) {
    print previous;
  }
}

It is assumed that the lines.txt file is sorted.

The test:

$ chmod +x script.awk
$ ./script.awk lines.txt
a
d
e

Method 4

With pcregrep for a given sample:

pcregrep -Mv '(.)n1$' file

or in a more general way:

pcregrep -Mv '(^.*)n1$' file

Method 5

If input is sorted:

perl -0pe  'while(s/^(.*)n1n//m){}'

Method 6

I like python for this, for example with python 2.7+

from itertools import groupby
with open('input') as f:
    for k, g in groupby(f):
            if len(list(g)) % 2:
                    print(k),

Method 7

As I understood the question I opted for awk, using a hash of each record, in this case I’m assuming that RS=n, but it can be changed to consider any other sort of arrangements, it can be arranged to consider an even number of reps, instead of the odd, with a parameter or a small dialog. Every line is used as the hash and its count is increased, at the end of the file the array is scanned and prints every even count of the record. I’m including the count in order to check but, removing a[x] is enough to solve that issue.

HTH

countlines code

#!/usr/bin/nawk -f
{a[$0]++}
END{for (x in a) if (a[x]%2!=0) print x,a[x] }

Sample Data:

a
One Sunny Day
a
a
b
my best friend
my best friend
b
c
c
c
One Sunny Day
c
d
my best friend
my best friend
d
d
d
One Sunny Day
d
e
x
k
j
my best friend
my best friend

Sample Run:

countlines feed.txt
j 1
k 1
x 1
a 3
One Sunny Day 3
d 5
e 1

Method 8

If input is sorted what about this awk:

awk '{ x[$0]++; if (prev != $0 && x[prev] % 2 == 1) { print prev; } prev = $0; } END { if (x[prev] % 2 == 1) print prev; }' sorted

Method 9

with perl:

uniq -c file | perl -lne 'if (m(^s*(d+) (.*)$)) {print $2 if $1 % 2 == 1}'

Method 10

Using shell constructs,

uniq -c file | while read a b; do if (( $a & 1 == 1 )); then echo $b; fi done

Method 11

Fun puzzle!

In Perl:

#! /usr/bin/env perl

use strict;
use warnings;

my $prev;
while (<>) {
  $prev = $_, next unless defined $prev;  # prime the pump

  if ($prev ne $_) {
    print $prev;
    $prev = $_;                           # first half of a new pair
  }
  else {
    undef $prev;                          # discard and unprime the pump
  }
}

print $prev if defined $prev;             # possible trailing odd line

Verbosely in Haskell:

main :: IO ()
main = interact removePairs
  where removePairs = unlines . go . lines
        go [] = []
        go [a] = [a]
        go (a:b:rest)
          | a == b = go rest
          | otherwise = a : go (b:rest)

Tersely in Haskell:

import Data.List (group)
main = interact $ unlines . map head . filter (odd . length) . group . lines

Method 12

a version: I use “delimiters” to simplify the inner loop (it assumes the first line is not __unlikely_beginning__ and it assumes the text is not ending with the line : __unlikely_ending__, and add that special delimiter line at the end of the inputed lines. Thus the algorithm can assume both: )

{ cat INPUTFILE_or_just_-  ; echo "__unlikely_ending__" ; } | awk '
  BEGIN {mem="__unlikely_beginning__"; occured=0; }  

    ($0 == mem)            { occured++ ; next } 

    ( occured%2 )           { print mem ;} 
                            { mem=$0; occured=1; }
'

So :

we remember the pattern we are currently looking at, increasing it by one everytime it reoccurs. [and if it did reoccurs, we skip the next 2 actions, which are for the case when the pattern changes]
When the pattern CHANGES:
- if not a multiple of 2, we print one occurence of the memorized pattern
- and in every case when the pattern has changed : the new memorized pattern is the current pattern, and we only saw it once.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating