Diffing two big text files

I have two big files (6GB each). They are unsorted, with linefeeds (n) as separators. How can I diff them? It should take under 24h.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

The most obvious answer is just to use the diff command and it is probably a good idea to add the –speed-large-files parameter to it.

diff --speed-large-files a.file b.file

You mention unsorted files so maybe you need to sort the files first

sort a.file > a.file.sorted
sort b.file > b.file.sorted
diff --speed-large-files a.file.sorted b.file.sorted

you could save creating an extra output file by piping the 2nd sort output direct into diff

sort a.file > a.file.sorted
sort b.file | diff --speed-large-files a.file.sorted -

Obviously these will run best on a system with plenty of available memory and you will likely need plenty of free disk space too.

It wasn’t clear from your question whether you have tried these before. If so then it would be helpful to know what went wrong (took too long etc.). I have always found that
the stock sort and diff commands tend to do at least as well as custom commands unless
there are some very domain specific properties of the files that make it possible to
do things differently.

Method 2

Sorting the inputs and telling the diff program it’s inputs are sorted would provide a massive speed up. I don’t know of any diff with an option like that but comm assumes sorted input and will be much quicker if it does enough for your purposes.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x