Fastest way to concatenate files

I’ve got 10k+ files totaling over 20GB that I need to concatenate into one file.

Is there a faster way than

cat input_file* >> out

?

The preferred way would be a bash command, Python is acceptable too if not considerably slower.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Nope, cat is surely the best way to do this. Why use python when there is a program already written in C for this purpose? However, you might want to consider using xargs in case the command line length exceeds ARG_MAX and you need more than one cat. Using GNU tools, this is equivalent to what you already have:

find . -maxdepth 1 -type f -name 'input_file*' -print0 |
  sort -z |
  xargs -0 cat -- >>out

Method 2

Allocating the space for the output file first may improve the overall speed as the system won’t have to update the allocation for every write.

For instance, if on Linux:

size=$({ find . -maxdepth 1 -type f -name 'input_file*' -printf '%s+'; echo 0;} | bc)
fallocate -l "$size" out &&
  find . -maxdepth 1 -type f -name 'input_file*' -print0 |
  sort -z | xargs -r0 cat 1<> out

Another benefit is that if there’s not enough free space, the copy will not be attempted.

If on btrfs, you could copy --reflink=always the first file (which implies no data copy and would therefore be almost instantaneous), and append the rest. If there are 10000 files, that probably won’t make much difference though unless the first file is very big.

There’s an API to generalise that to ref-copy all the files (the BTRFS_IOC_CLONE_RANGE ioctl), but I could not find any utility exposing that API, so you’d have to do it in C (or python or other languages provided they can call arbitrary ioctls).

If the source files are sparse or have large sequences of NUL characters, you could make a sparse output file (saving time and disk space) with (on GNU systems):

find . -maxdepth 1 -type f -name 'input_file*' -print0 |
  sort -z | xargs -r0 cat | cp --sparse=always /dev/stdin out


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x