I am trying to sort some simple pipe-delimited data. However, sort isn’t actually sorting. It moves my header row to the bottom, but my two rows starting with 241 are being split by a row starting with 24.
cat sort_fail.csv column_a|column_b|column_c 241|212|20810378 24|121|2810172 241|213|20810376 sort sort_fail.csv 241|212|20810378 24|121|2810172 241|213|20810376 column_a|column_b|column_c
The column headers are being moved to the bottom of the file, so sort is clearly processing it. But, the actual values aren’t being sorted like I’d expect.
In this case I worked around it with
sort sort_fail.csv --field-separator='|' -k1,1
But, I feel like that shouldn’t be necessary. Why is sort not sorting?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
sort is locale aware, so depending on your LC_COLLATE setting (which is inherited from LANG) you may get different results:
$ LANG=C sort sort_fail.csv 241|212|20810378 241|213|20810376 24|121|2810172 column_a|column_b|column_c $ LANG=en_US sort sort_fail.csv 241|212|20810378 24|121|2810172 241|213|20810376 column_a|column_b|column_c
This can cause problems in scripts, because you may not be aware of what the calling locale is set to, and so may get different results.
It’s not uncommon for scripts to force the setting needed
e.g.
$ grep 'LC.*sort' /bin/precat
LC_COLLATE=C sort -u | prezip-bin -z "$cmd: $2"
Now what’s interesting, here, is the | character looks odd.
But that’s because the default rule for en_US, which derives from ISO, says
$ grep 007C /usr/share/i18n/locales/iso14651_t1_common <U007C> IGNORE;IGNORE;IGNORE;<j> # 142 |
Which means the | character is ignored and the sort order would be as if the character doesn’t exist..
$ tr -d '|' < sort_fail.csv | LANG=C sort 24121220810378 241212810172 24121320810376 column_acolumn_bcolumn_c
And that matches the “unexpected” sorting you are seeing.
The work arounds are to use -n (to force numeric sorts), or to use the field separator (as you did) or to use the C locale.
Method 2
What irritates me is that the 24 doesn’t move from its place between the two 241. The second field starts with a 1. Trying the sort with a leading 4 in the second field, the 24 is moved down, so I suspect sort just ignores the | unless told otherwise.
Try sort -n…
Method 3
-n, –numeric-sort
compare according to string numerical value
210 23
Without the -n, 210 by text is ahead of 23 as it goes character my character.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0