Here are commands on a random file from pastebin:
wget -qO - http://pastebin.com/0cSPs9LR | wc -l 350 wget -qO - http://pastebin.com/0cSPs9LR | sort -u | wc -l 287 wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq | wc -l 287 wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq -u | wc -l 258
The man pages are not clear on what the -u flag is doing. Any advice?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
uniq with -u skips any lines that have duplicates. Thus:
$ printf "%sn" 1 1 2 3 | uniq 1 2 3 $ printf "%sn" 1 1 2 3 | uniq -u 2 3
Usually, uniq prints lines at most once (assuming sorted input). This option actually prints lines which are truly unique (having not appeared again).
Method 2
Short version:
uniq, without-u, makes every line of the output unique.uniq -uonly prints every unique line from the input.
Slightly longer version:
uniq is for dealing with files that have lines duplicated, and only when those lines appear successively in the input. So, for its purposes, a unique line is one that is not duplicated immediately.
(uniq has a very limited short-term memory; it will never remember whether a line appeared earlier in the input, unless it was the immediately previous line — this is why uniq is very often paired with sort.)
When it encounters a run of duplicate lines, uniq, without the -u arg, prints one copy of that line. (It makes every line of the output unique).
With the -u argument, it prints zero copies of that line — runs of duplicates just get omitted from the output.
Method 3
uniq POSIX spec described it clearly:
-u
Suppress the writing of lines that are repeated in the input.
-u option make uniq not to print repeated lines.
Most uniq implementations used bytes comparison, while GNU uniq used collation order to filter duplicated lines. So it can produce wrong result in some locales, example in en_US.UTF-8 locale:
$ printf '%bn' 'U2460' 'U2461' | uniq ①
and -u gave you no lines:
$ printf '%bn' 'U2460' 'U2461' | uniq -u <blank>
So you should set locale to C to get bytes comparison:
$ printf '%bn' 'U2460' 'U2461' | LC_ALL=C uniq ① ②
Method 4
normal:
echo "a b a b c c c" | tr ' ' 'n' a b a b c c c
uniq : no two subsequent repeating lines
echo "a b a b c c c" | tr ' ' 'n' | uniq a b a b c
sorted
echo "a b a b c c c" | tr ' ' 'n' | sort a a b b c c c
sort -u : no two repeating lines
echo "a b a b c c c" | tr ' ' 'n' | sort -u a b c
sort / uniq: all distinct
echo "a b a b c c c" | tr ' ' 'n' | sort | uniq a b c
counts distinct occurrences
echo "a b a b c c c" | tr ' ' 'n' | sort | uniq -c 2 a 2 b 3 c
only lines which are not repeated (not sorted first)
echo "a b a b c c c" | tr ' ' 'n' | uniq -u a b a b
only lines which are not repeated (after sorting)
echo "a b a b c c c Z" | tr ' ' 'n' | sort | uniq -u Z
uniq -d : only print duplicate lines, one for each group
echo "a b a b c c c" | tr ' ' 'n' | uniq -d c
.. counted
echo "a b a b c c c" | tr ' ' 'n' | uniq -dc 3 c
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0