I have a text file includes over than 10 million lines. Lines like that:
37024469;196672001;255.0000000000 37024469;196665001;396.0000000000 37024469;196664001;396.0000000000 37024469;196399002;85.0000000000 37024469;160507001;264.0000000000 37024469;160506001;264.0000000000
As you seen, delimiter is “;”. i would like to sort this text file by using python according to the second element. I couldnt use split function. Because it causes MemoryError. how can i manage it ?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Don’t sort 10 million lines in memory. Split this up in batches instead:
-
Run 100 100k line sorts (using the file as an iterator, combined with
islice()or similar to pick a batch). Write out to separate files elsewhere. -
Merge the sorted files. Here is an merge generator that you can pass 100 open files and it’ll yield lines in sorted order. Write to a new file line by line:
import operator def mergeiter(*iterables, **kwargs): """Given a set of sorted iterables, yield the next value in merged order Takes an optional `key` callable to compare values by. """ iterables = [iter(it) for it in iterables] iterables = {i: [next(it), i, it] for i, it in enumerate(iterables)} if 'key' not in kwargs: key = operator.itemgetter(0) else: key = lambda item, key=kwargs['key']: key(item[0]) while True: value, i, it = min(iterables.values(), key=key) yield value try: iterables[i][0] = next(it) except StopIteration: del iterables[i] if not iterables: raise
Method 2
Based on Sorting a million 32-bit integers in 2MB of RAM using Python:
import sys
from functools import partial
from heapq import merge
from tempfile import TemporaryFile
# define sorting criteria
def second_column(line, default=float("inf")):
try:
return int(line.split(";", 2)[1]) # use int() for numeric sort
except (IndexError, ValueError):
return default # a key for non-integer or non-existent 2nd column
# sort lines in small batches, write intermediate results to temporary files
sorted_files = []
nbytes = 1 << 20 # load around nbytes bytes at a time
for lines in iter(partial(sys.stdin.readlines, nbytes), []):
lines.sort(key=second_column) # sort current batch
f = TemporaryFile("w+")
f.writelines(lines)
f.seek(0) # rewind
sorted_files.append(f)
# merge & write the result
sys.stdout.writelines(merge(*sorted_files, key=second_column))
# clean up
for f in sorted_files:
f.close() # temporary file is deleted when it closes
heapq.merge() has key parameter since Python 3.5. You could try mergeiter() from Martijn Pieters’ answer instead or do Schwartzian transform on older Python versions:
iters = [((second_column(line), line) for line in file)
for file in sorted_files] # note: this makes the sort unstable
sorted_lines = (line for _, line in merge(*iters))
sys.stdout.writelines(sorted_lines)
Usage:
$ python sort-k2-n.py < input.txt > output.txt
Method 3
You can do it with an os.system() call to the bash function sort
sort -k2 yourFile.txt
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0