How is uniq not unique enough that there is also uniq –unique?

Here are commands on a random file from pastebin:

wget -qO - http://pastebin.com/0cSPs9LR | wc -l
350
wget -qO - http://pastebin.com/0cSPs9LR | sort -u | wc -l
287
wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq | wc -l
287
wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq -u | wc -l
258

The man pages are not clear on what the -u flag is doing. Any advice?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

uniq with -u skips any lines that have duplicates. Thus:

$ printf "%sn" 1 1 2 3 | uniq
1
2
3
$ printf "%sn" 1 1 2 3 | uniq -u
2
3

Usually, uniq prints lines at most once (assuming sorted input). This option actually prints lines which are truly unique (having not appeared again).

Method 2

Short version:

  • uniq, without -u, makes every line of the output unique.
  • uniq -u only prints every unique line from the input.

Slightly longer version:

uniq is for dealing with files that have lines duplicated, and only when those lines appear successively in the input. So, for its purposes, a unique line is one that is not duplicated immediately.

(uniq has a very limited short-term memory; it will never remember whether a line appeared earlier in the input, unless it was the immediately previous line — this is why uniq is very often paired with sort.)

When it encounters a run of duplicate lines, uniq, without the -u arg, prints one copy of that line. (It makes every line of the output unique).

With the -u argument, it prints zero copies of that line — runs of duplicates just get omitted from the output.

Method 3

uniq POSIX spec described it clearly:

-u
    Suppress the writing of lines that are repeated in the input.

-u option make uniq not to print repeated lines.

Most uniq implementations used bytes comparison, while GNU uniq used collation order to filter duplicated lines. So it can produce wrong result in some locales, example in en_US.UTF-8 locale:

$ printf '%bn' 'U2460' 'U2461' | uniq
①

and -u gave you no lines:

$ printf '%bn' 'U2460' 'U2461' | uniq -u
<blank>

So you should set locale to C to get bytes comparison:

$ printf '%bn' 'U2460' 'U2461' | LC_ALL=C uniq
①
②

Method 4

normal:

echo "a b a b c c c" | tr ' ' 'n'
a
b
a
b
c
c
c

uniq : no two subsequent repeating lines

echo "a b a b c c c" | tr ' ' 'n' | uniq
a
b
a
b
c

sorted

echo "a b a b c c c" | tr ' ' 'n' | sort
a
a
b
b
c
c
c

sort -u : no two repeating lines

echo "a b a b c c c" | tr ' ' 'n' | sort -u
a
b
c

sort / uniq: all distinct

echo "a b a b c c c" | tr ' ' 'n' | sort | uniq
a
b
c

counts distinct occurrences

echo "a b a b c c c" | tr ' ' 'n' | sort | uniq -c
2 a
2 b
3 c

only lines which are not repeated (not sorted first)

echo "a b a b c c c" | tr ' ' 'n' | uniq -u
a
b
a
b

only lines which are not repeated (after sorting)

echo "a b a b c c c Z" | tr ' ' 'n' | sort | uniq -u
Z

uniq -d : only print duplicate lines, one for each group

echo "a b a b c c c" | tr ' ' 'n' | uniq -d
c

.. counted

echo "a b a b c c c" | tr ' ' 'n' | uniq -dc
3 c


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x