Scalability of ‘sort -u’ for gigantic files

What is reasonable scalability limit of sort -u?
(in dimensions of “line length”, “amount of lines”, “total file size”)

What is Unix alternative for files exceeding this in dimension of “amount of lines”?

Of course I can easily implement one, but I wondered if there is something that can be done with few standard Linux commands.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

The sort that you find on Linux comes from the coreutils package and implements an External R-Way merge. It splits up the data into chunks that it can handle in memory, stores them on disc and then merges them. The chunks are done in parallel, if the machine has the processors for that.

So if there was to be a limit, it is the free disc space that sort can use to store the temporary files it has to merge, combined with the result.

Method 2

I cannot speak for vendor specific implementations, but the UNIX sort implementation splits large files into smaller files, sorts these files and then combines the sorted smaller files into an aggregated sorted output.

The only limitation is the disk space for the smaller files created intermediately by sort, but the files can be redirected to an arbitrary directory by setting the environment variable TMPDIR.

Method 3

Based on https://blog.mafr.de/2010/05/23/sorting-large-files/ and https://unix.stackexchange.com/a/88704/9689 :

split -n l/20 input input-
for inpf in input-* ; do
    sort --parallel="$(nproc --all)" "${inpf}" > sorted-"{$inpf}"
done
sort -m sorted-input-* > sorted-input

Update:

From answers above we see that sort already does what mentioned snippet – i.e. External R-Way merge. So after all running just:

sort --parallel="$(nproc --all)" -u input > output

Should be sufficient.

My current assumptions (without checking code) about limits are:

  • maximum line length is limited by amount of physical memory. Sort need to fit at least two into memory
  • amount of lines – I am not aware of
  • file size – of course by filesystem
  • amount of opened files in parallel – depending on Operating System (Thanks Diomidis Spinellis for pointing this out!)

(This answer is marked as community wiki – feel encouraged to improve it! 🙂 )


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments