[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [RFC/PATCH 0/4] Re: Bug#605009: serious performance regression with ext4



Theodore Tso wrote:

> So Patch #2 wasn't quite what I talked about doing; patch #2 is adding
> SYNC_FILE_RANGE_WAIT_BEFORE for each file immediately after writing the file.
> So it's the equivalent of:
>
>      extract(a)
>      sync_file_range(SYNC_FILE_RANGE_WRITE)
>      sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)
>      extract(b)
>      sync_file_range(SYNC_FILE_RANGE_WRITE)
>      sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)
>      extract(b)
>      sync_file_range(SYNC_FILE_RANGE_WRITE)
>      sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)

More precisely:

	extract(a)
	sync_file_range(SYNC_FILE_RANGE_WRITE)
	extract(b)
	sync_file_range(SYNC_FILE_RANGE_WRITE)
	extract(c)
	sync_file_range(SYNC_FILE_RANGE_WRITE)

	sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)
	fsync()
	rename()
	sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)
	fsync()
	rename()
	sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)
	fsync()
	rename()

> What I was suggesting was to use a separate for loop in patch #2, like
> patch #3:

Yes.

> As to why the "voodoo", the idea is to make sure all of the delayed
> allocation, for all of the files, is completely resolved before the first
> fsync().

You just explained this on linux-ext4.  If I understand correctly, the
effect would be most visible when there is parallel I/O going on: by
finishing up the delayed allocation, we ensure the block bitmap, inode
table block for the inodes, etc are in their final form before the first
fdatasync, eliminating the need for all but the first barrier.  And
barriers are expensive because they require the cooperation of the layer
underneath in an LVM/LUKS or slow disk scenario and because they cause
unrelated I/O to be sent in small batches when the barriers occur.

> The reason why I suggested doing the WAIT_BEFORE as a separate
> path was to allow for parallelism in the case where /var/cache/apt/archives
> is on a different disk spindle than /usr.
[...]
> Does that help to visualize what I was going for?

Patch 1+2 already allows that parallelism, but it wouldn't turn the
fdatasync after the first into no-ops.

> BTW, if you had opened the file handle in subsequent passes using
> O_RDONLY|O_NOATIME, the use of fdatasync() instead of fsync() might not have
> been necessary.

Very interesting!  Thanks for explaining.
Jonathan


Reply to: