join : “File 2 not in sorted order”
I’ve got two files _jeter3.txt and _jeter1.txt
I’ve checked they are both sorted on the 20th column using sort -c
sort -t ' ' -c -k20,20 _jeter3.txt sort -t ' ' -c -k20,20 _jeter1.txt #no errors
but there is an error when I want to
join
both files it says that the second file is not sorted:join -t ' ' -1 20 -2 20 _jeter1.txt _jeter3.txt > /dev/null join: File 2 is not in sorted order
I don’t understand why.
cat /etc/*-release #FYI openSUSE 11.0 (i586) VERSION = 11.0
UPDATE: using ‘
sort -f
‘ and join -i
(both case insensitive) fixes the problem. But it doesn’t explain my initial problem.UPDATE: versions of sort & join:
> join --version join (GNU coreutils) 6.11 Copyright (C) 2008 Free Software Foundation, Inc. (...) > sort --version sort (GNU coreutils) 6.11 Copyright (C) 2008 Free Software Foundation, Inc. (...)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I got the same error with Ubuntu 11.04, with sort
and join
both in version (GNU coreutils) 8.5.
They are clearly incompatible. In fact the sort
command seems bugged: there is no difference with or without the -f
(--ignore-case
) option. When sorting, aaB
is always before aBa
. Non alphanumeric characters seems also always ignored (abc
is before ab-x
)
Join seems to expect the opposite… But I have a solution
In fact, this is linked to the collation sequence: using LANG=en_EN sort -k 1,1 <myfile> ...
then LANG=en_EN join ...
eliminates the message.
Internationalisation is the root of evil… (nobody documents it clearly).
Method 2
sort
by default uses the entire line as the key
join
uses only the specified field as the key.
You must correct this incompatibility by restricting sort to use only the key you want to join on.
The Join man page states:
Important: FILE1 and FILE2 must be sorted on the join fields. E.g., use ‘sort -k 1b,1′ if >’join’ has no options. Note, comparisons honor the rules specified by ‘LC_COLLATE’. If the >input is not sorted and some lines cannot be joined, a warning message will be given.
Method 3
If you are sure you properly sorted your input files and their lines can be paired, you can avoid the above error by running join --nocheck-order file1.txt file2.txt
Method 4
Were you sorting with numbers? I found that zero-padding the column that I was joining on solved this issue for me.
cat file.txt
| awk -F" " '{ $20=sprintf("%06s", $20); print $0}'
| sort > readytojoin.txt
Method 5
Note that if you see this error, and you have already sorted on a specific column and are beating your head against the wall e.g. sort -k4,4 then you may also need to set the separator for the sort command
Apparently OP already did this with -t ‘ ‘ but for a normal tab separated text I’d recommend
sort -t $'t' ...
The sort command can incorporate spaces as separators by default even on something that looks like a tab separated file (especially if there are spaces inside the column you are sorting on).
Then if you passed that sorted data to join, and you have
join -t $'t' ...
Then this ends up causing the error message about it being unsorted. As noted above, join may not accept -t ‘ ‘ though.
Method 6
LOCALE=C sort ... LOCALE=C join ...
This will solve your problem. The issue, as pointed out by @Michael, is collation sequence, which depends on your LOCALE setting.
Method 7
For join the argument after -t is a character. For sort you can supply a longer sort separator. I think that you may be joining the files on a different field that you want to, and ignoring the case solves the problem by coincidence.
And I agreee with Gilles, that sample data would be helpful.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0