In this test, why does rename() take longer when fsync() is called first?
Environment: btrfs, mechanical HDD, Debian 9 container, running on kernel 5.0.17-200.fc29.x86_64.
Test command: dpkg -r linux-image-4.9.0-9-amd64 >/dev/null 2>&1 && sync && time perf_4.9 trace --no-inherit -s dpkg $FORCE_UNSAFE_IO -i linux-image-4.9.0-9-amd64_4.9.168-1_amd64.deb && time sync
Compare the results of FORCE_UNSAFE_IO="" v.s. FORCE_UNSAFE_IO="--force-unsafe-io".
dpkg (31632), 374488 events, 100.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
fsync 3442 14849.586 0.002 4.314 149.959 4.11%
rename 8463 14573.509 0.003 1.722 358.675 4.80%
wait4 7 8043.762 0.004 1149.109 8028.468 99.78%
read 44025 2151.135 0.000 0.049 3.732 0.57%
open 19301 213.628 0.002 0.011 0.375 0.90%
write 7846 165.460 0.003 0.021 0.149 0.42%
sync_file_range 6834 96.513 0.001 0.014 0.822 2.20%
...
real 0m41.703s
user 0m9.709s
sys 0m6.586s
real 0m0.162s
user 0m0.000s
sys 0m0.003s
dpkg (1919), 334232 events, 100.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
wait4 7 8290.981 0.007 1184.426 8279.676 99.84%
read 44399 2168.096 0.000 0.049 2.146 0.50%
fsync 25 653.530 0.006 26.141 68.754 8.65%
rename 8463 522.282 0.003 0.062 69.620 22.53%
open 12467 163.671 0.002 0.013 0.217 0.97%
write 7846 160.979 0.003 0.021 0.356 0.50%
sync_file_range 3417 89.676 0.010 0.026 0.841 2.05%
...
real 0m13.498s
user 0m9.643s
sys 0m5.517s
real 0m0.146s
user 0m0.000s
sys 0m0.004s
The current strategy of dpkg (e.g. in Debian 9) is more sophisticated than you might be aware. I am not sure it would really affect this case though. If you want more details, there is some background in this question: Could AIO fsync improve dpkg performance?
I do not know if this is relevant, but it occurs to me that on some filesystems fsync() may effectively sync the directory as well. This is to make sure newly created files are visible on-disk before fsync() returns. I read somewhere this does not happen on ext2, but it does happen on ext4. As partial evidence, see ext4: make fsync to sync parent dir in no-journal for real this time
In case you are surprised by the trailing sync timings, I can confirm that patching dpkg to replace the individual fsync() calls with a global sync() call appears to keep the overall time down to about 13s. And I have not found anything inadequate in that on my system. dpkg just stopped using that approach because of other potential side effects.[1][2]
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Based on the commit description, I expect the rename() delays are caused by Btrfs: sync log after logging new name. This was added in kernel v4.19.
Make the logging of the new file name (which happens when creating a hard link or renaming) persist the log.
This approach not only is simpler, […] but also gives us the same behaviour as ext4, xfs and f2fs (possibly other filesystems too).
I do not believe the second sentence is correct!
To be fair I should point out dpkg is forgetting to fsync() the directories containing the files, before it records the package as correctly installed. But this btrfs behaviour is not exactly a perfect match with the rest of Linux.
I do not believe XFS syncs the new directory entry inside rename() (i.e.
deliberately waits for it to be persisted). My presumption against any sync writes inside XFS rename() is partly based
on this thread: https://marc.info/?l=linux-xfs&m=139863577410237&w=2
For ext4, I mentioned evidence that fsync() might sync the new directory entry before it returns. But I do not believe that ext4’s rename() does so.
I linked to recent discussions about AIO fsync() operations, and how they might allow efficient batching of metadata updates. There has not been much discussion about a hypothetical AIO rename(), because the usual assumption is that rename() is not a synchronous operation!
(btrfs feels a bit suspicious to me in general. I.e. I see this data integrity bugfix was in the past few releases, and it was not the only scary-sounding fix in the changelog for these releases).
I think the rename() delays must be triggered by the BTRFS_NEED_LOG_SYNC returned from the last line of btrfs_log_new_name().
The way I found this was by using offcputime. It aggregates wait time by stack trace. The stack traces look like this:
io_schedule_timeout
wait_for_completion_io
write_all_supers
btrfs_sync_log
btrfs_sync_file
do_fsync
__x64_sys_fsync
do_syscall_64
entry_SYSCALL_64_after_hwframe
- dpkg (23528)
9735954
io_schedule_timeout
wait_for_completion_io
write_all_supers
btrfs_sync_log
btrfs_rename2
vfs_rename
do_renameat2
__x64_sys_rename
do_syscall_64
entry_SYSCALL_64_after_hwframe
- dpkg (23528)
9147785
io_schedule
bit_wait_io
__wait_on_bit
out_of_line_wait_on_bit
write_all_supers
btrfs_sync_log
btrfs_sync_file
do_fsync
__x64_sys_fsync
do_syscall_64
entry_SYSCALL_64_after_hwframe
- dpkg (23528)
4478158
io_schedule
bit_wait_io
__wait_on_bit
out_of_line_wait_on_bit
write_all_supers
btrfs_sync_log
btrfs_rename2
vfs_rename
do_renameat2
__x64_sys_rename
do_syscall_64
entry_SYSCALL_64_after_hwframe
- dpkg (23528)
4376109
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0