How to sort big files?

I have a PC with Intel(R) Pentium(R) CPU G640 @ 2.80 GHz and 8 GB of RAM. I am running Scientific Linux 6.5 on it with EXT3 filesystem.

On this setup, what is the fastest way I can do a sort -u on a 200 gigabyte file?

Should I split the file into smaller files (smaller than 8 GB), sort -u them, put them together, then split them again in a different size, sort -u again, etc.? Or are there any sorting scripts, programs that could handle files this big with my limited amount of RAM?

Contents hide

Answers:

Method 1

Method 2

Method 3

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

GNU sort (which is the default on most Linux systems), has a --parallel option. From http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html:

‘–parallel=n’

Set the number of sorts run in parallel to n. By default, n is set to the number of available processors, but limited to 8, as there are diminishing performance gains after that. Note also that using n threads increases the memory usage by a factor of log n. Also see nproc invocation.

Since your cpu has 2 cores, you could do:

sort --parallel=2 -uo list-sorted.txt list.txt

It is better to specify the actual number of cores since there may appear to be more due to the processor having hyper-threading.

You could also experiment with nice to influence the processor scheduling priority and ionice to influence I/O scheduling. You can increase the priority over other processes like this, I don’t think this will give you large savings as they are usually better for making sure a background process doesn’t use too much resources. Never-the-less you can combine them with something like:

nice -n -20 ionice -c2 -n7 sort --parallel=2 -uo list-sorted.txt list.txt

Note also that as Gilles commented, using a single GNU sort command will be faster than any other method of breaking down the sorting as the algorithm is already optimised to handle large files. Anything else will likely just slow things down.

Method 2

Using the sort command will probably be the fastest option.

But you’ll probably want to fix the locale to C.

sort -u doesn’t report unique lines, but one of each set of lines that sort the same. In the C locale, 2 different lines necessarily don’t sort the same, but that’s not the case in most UTF-8 based locales on GNU systems.

Also, using the C locale avoids the overhead of having to parse UTF-8 and processing complex sort orders so would improve performance dramatically.

So:

LC_ALL=C sort -u file

You can also improve performance by using a faster drive (or a different drive from the one where the input and/or output files are) for the temporary files (using -T or $TMPDIR environment variable), or by fiddling with the -S option supported by some sort implementations).

For some type of input or for slow storage, using the --compress-program option of GNU sort (for instance with lzop) might improve performance in addition to storage usage.

Now just a note for those objecting (rightly to some extent) that it will not be the correct order:

I agree that as a human, I’d like to see Stéphane sort in between Stefan and Stephanie, but:
A computer would want Stéphane to sort after since é (at least when expressed as U+00E9) as a character or the bytes of its UTF-8 encoding sorts after (in terms of codepoint or byte value). That’s a sort order that is very simple to implement and is a strict total order and has no surprise.
Your locale’s sort order will likely not be satisfactory in many cases either even to a human. For example on my system with the default en_GB.utf8 locale:
Stéphane and Stéphane (one with U+00E9, the other with eU+0301) don’t sort the same:
$ printf '%bn' 'Steu0301phane' 'Stu00e9phane' | sort -u
Stéphane
Stéphane
but ③, ①, ② all sort the same (obviously a bug in those locale definitions¹):
$ printf '%sn' ③ ① ② | sort -u
③
Here, it’s ③, but it could just as well have been ① or ②
So IMO, chances are you always want sort -u with LC_ALL=C, if you want unique lines. And if you want that resulting list to be sorted in the user’s sort order, pipe it to sort again:
LC_ALL=C sort -u | sort

LC_ALL=C sort | LC_ALL=C uniq -c | sort -k2

^{¹ 2019 edit. the order of ① ② ③ ④ ⑤… has since been fixed in newer versions of the GNU libc, but as of 2.30, over 95% of characters still don’t have a defined order, you can replace ① ② ③ ④ ⑤ with 🧙 🧚 🧛 🧜 🧝 for instance. Hopefully, GNU locales will eventually be fixed completely (they will have to if they want to comply to the next revision of the standard) and the problem will then be limited to user-defined locales}

Method 3

Here is a ready to use bash script for sorting TB scale data on a regular machine with couple of GB ram: http://sgolconda.blogspot.com/2015/11/sort-very-large-dataset.html
It checks number of cores your machine as and uses all cores.
Can sort, numeric or string files.
Can be used to find unique records in TB scale data.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating