Re: [RFC/PATCH 0/4] Re: Bug#605009: serious performance regression with ext4

To: Jonathan Nieder <jrnieder@gmail.com>
Cc: debian-dpkg@lists.debian.org, debian-kernel@lists.debian.org
Subject: Re: [RFC/PATCH 0/4] Re: Bug#605009: serious performance regression with ext4
From: Theodore Tso <tytso@MIT.EDU>
Date: Mon, 29 Nov 2010 08:01:04 -0500
Message-id: <[🔎] 461105DE-F8D2-421F-92E9-23E556823DD2@mit.edu>
In-reply-to: <[🔎] 20101129064825.GA6750@burratino>
References: <20101126093257.23480.86900.reportbug@pluto.milchstrasse.xx> <[🔎] 20101126145327.GB19399@rivendell.home.ouaza.com> <[🔎] 20101126215254.GJ2767@thunk.org> <[🔎] 20101127075831.GC24433@burratino> <[🔎] 20101127085346.GD14011@rivendell.home.ouaza.com> <[🔎] 20101129041152.GQ2767@thunk.org> <[🔎] 20101129064825.GA6750@burratino>

On Nov 29, 2010, at 1:48 AM, Jonathan Nieder wrote:

> 
> Results (on ext4) suggest that patches 1 and 4 matter and the rest are
> within noise.  Timings are rough; sometimes replicates vary by as much
> as a second.  Numbers are cold cache (i.e., after running sync and
> echo 3>.../drop_caches), best of 3, dpkg --install python2.7 and
> python2.7-minimal.
> 
> before:
> 5.73user 1.62system 0:33.84elapsed 21%CPU (0avgtext+0avgdata 89968maxresident)k
> 0inputs+0outputs (0major+46962minor)pagefaults 0swaps
> 
> patch 1 (use SYNC_FILE_RANGE_WRITE):
> 5.64user 1.69system 0:10.47elapsed 69%CPU (0avgtext+0avgdata 90000maxresident)k
> 0inputs+0outputs (0major+46948minor)pagefaults 0swaps
> 
> patch 1+2 (use SYNC_FILE_RANGE_WAIT_BEFORE):
> 5.48user 1.61system 0:10.43elapsed 70%CPU (0avgtext+0avgdata 90000maxresident)k
> 0inputs+0outputs (0major+46958minor)pagefaults 0swaps

So Patch #2 wasn't quite what I talked about doing; patch #2 is adding SYNC_FILE_RANGE_WAIT_BEFORE for each file immediately after writing the file.   So it's the equivalent of:

     extract(a)
     sync_file_range(SYNC_FILE_RANGE_WRITE)
     sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)
     extract(b)
     sync_file_range(SYNC_FILE_RANGE_WRITE)
     sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)
     extract(b)
     sync_file_range(SYNC_FILE_RANGE_WRITE)
     sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)

What I was suggesting was to use a separate for loop in patch #2, like patch #3:

     extract(a)
     sync_file_range(SYNC_FILE_RANGE_WRITE)
     extract(b)
     sync_file_range(SYNC_FILE_RANGE_WRITE)
     extract(b)
     sync_file_range(SYNC_FILE_RANGE_WRITE)

     sync_file_range(a, SYNC_FILE_RANGE_WAIT_BEFORE)
     sync_file_range(b, SYNC_FILE_RANGE_WAIT_BEFORE)
     sync_file_range(c, SYNC_FILE_RANGE_WAIT_BEFORE)

As to why the "voodoo", the idea is to make sure all of the delayed allocation, for all of the files, is completely resolved before the first fsync().    The reason why I suggested doing the WAIT_BEFORE as a separate path was to allow for parallelism in the case where /var/cache/apt/archives is on a different disk spindle than /usr.   By doing this:

     extract(a)
     sync_file_range(SYNC_FILE_RANGE_WRITE)
     sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)
     extract(b)
     sync_file_range(SYNC_FILE_RANGE_WRITE)
     sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE)

.... we make the copying get done in lockstep; that is, we don't start extracting file b until the data blocks for a are done being written.   If /var and /usr were mounted on different floppy disks (yeah, I know) you'd see first one disk light up, and then the other disk light up, back and forth, and it would be slow and horrible.   With the mechanism I suggested, both lights would be on at the same time, since SYNC_FILE_RANGE_WRITE initiates the writeback, but does not block for it to complete.   SYNC_FILE_RANGE_WAIT_BEFORE is what actually blocks.  Does that help to visualize what I was going for?

BTW, if you had opened the file handle in subsequent passes using O_RDONLY|O_NOATIME, the use of fdatasync() instead of fsync() might not have been necessary.   And as far as the comments in patch #4 was concerned, it wasn't a matter of delaying the file modification time update that was my concern; it was avoiding an update of the file access time caused by reopening the file which concerned me.   The reason why I did both in my test program was because (a) I was paranoid, and (b) fdatasync() is standard, where as O_NOATIME is another Linux-specific thing.

Thinking about this some more, though, using O_NOATIME may actually save more time, and may in the end be more important than the use of fdatasync() vs. fsync().  (Although I like doing the last amount of work necessary, and in this case we really don't need to use fsync(); fdatasync() will do.)

-- Ted

Reply to:

Follow-Ups:
- Re: [RFC/PATCH 0/4] Re: Bug#605009: serious performance regression with ext4
  - From: Raphael Hertzog <hertzog@debian.org>
- Re: [RFC/PATCH 0/4] Re: Bug#605009: serious performance regression with ext4
  - From: Jonathan Nieder <jrnieder@gmail.com>

References:
- Re: Bug#605009: serious performance regression with ext4
  - From: Raphael Hertzog <hertzog@debian.org>
- Re: Bug#605009: serious performance regression with ext4
  - From: Ted Ts'o <tytso@mit.edu>
- Re: Bug#605009: serious performance regression with ext4
  - From: Jonathan Nieder <jrnieder@gmail.com>
- Re: Bug#605009: serious performance regression with ext4
  - From: Raphael Hertzog <hertzog@debian.org>
- Re: Bug#605009: serious performance regression with ext4
  - From: Ted Ts'o <tytso@mit.edu>
- [RFC/PATCH 0/4] Re: Bug#605009: serious performance regression with ext4
  - From: Jonathan Nieder <jrnieder@gmail.com>

Prev by Date: Re: Bug#605009: serious performance regression with ext4
Next by Date: Re: Bug#605009: serious performance regression with ext4
Previous by thread: [PATCH 4/4] dpkg: Use fdatasync instead of fsync
Next by thread: Re: [RFC/PATCH 0/4] Re: Bug#605009: serious performance regression with ext4
Index(es):
- Date
- Thread