Unexpected sort order in en_US.UTF-8 locale

While trying to answer this question about SQL sorting, I noticed a sort order I did not expect:

$ export LC_ALL=en_US.UTF-8  
$ echo "T-700A Grouped" > sort.txt
$ echo "T-700 AGrouped" >> sort.txt
$ echo "T-700A Halved" >> sort.txt
$ echo "T-700 Whole" >> sort.txt
$ cat sort.txt | sort
T-700 AGrouped
T-700A Grouped
T-700A Halved
T-700 Whole
$

Why is 700 A sorted above 700A, while 700A is above 700 W ? I would expect a space to come before A consistently, independent of the characters following it.

It works fine if you use the C locale:

$ export LC_ALL=C
$ echo "T-700A Grouped" > sort.txt
$ echo "T-700 AGrouped" >> sort.txt
$ echo "T-700A Halved" >> sort.txt
$ echo "T-700 Whole" >> sort.txt
$ cat sort.txt | sort
T-700 AGrouped
T-700 Whole
T-700A Grouped
T-700A Halved
$

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Sorting is done in multiple passes. Each character has three (or sometimes more) weights assigned to it. Let’s say for this example the weights are

         wt#1 wt#2 wt#3
space = [0000.0020.0002]
A     = [1BC2.0020.0008]

To create the sort key, the nonzero weights of the characters of a string are concatenated, one weight level at a time. That is, if a weight is zero, no corresponding weight is added (as can be seen at the beginning for " A"). So

       wt#1   -- wt#2 ---   -- wt#3 ---
" A" = 1BC2   0020   0020   0002   0008
       A      sp     A      sp     A

       wt#1   wt#2   wt#3
"A"  = 1BC2   0020   0008
       A      A      A

       wt#1   -- wt#2 ---   -- wt#3 ---
"A " = 1BC2   0020   0020   0008   0002
       A      A      sp     A      sp

If you sort these arrays you get the order you see:

       1BC2   0020   0008               => "A"
       1BC2   0020   0020   0002   0008 => " A"
       1BC2   0020   0020   0008   0002 => "A "

This is a simplification of what actually happens; see the Unicode Collation Algorithm for more details. The above example weights are actually from the standard table, with some details omitted.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x