[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel BUG at fs/jbd2/commit.c:534" from Postfix on ext4



On Mon 27-06-11 23:21:17, Moffett, Kyle D wrote:
> On Jun 27, 2011, at 12:01, Ted Ts'o wrote:
> > On Mon, Jun 27, 2011 at 05:30:11PM +0200, Lukas Czerner wrote:
> >>> I've found some. So although data=journal users are minority, there are
> >>> some. That being said I agree with you we should do something about it
> >>> - either state that we want to fully support data=journal - and then we
> >>> should really do better with testing it or deprecate it and remove it
> >>> (which would save us some complications in the code).
> >>> 
> >>> I would be slightly in favor of removing it (code simplicity, less options
> >>> to configure for admin, less options to test for us, some users I've come
> >>> across actually were not quite sure why they are using it - they just
> >>> thought it looks safer).
> > 
> > Hmm...  FYI, I hope to be able to bring on line automated testing for
> > ext4 later this summer (there's a testing person at Google is has
> > signed up to work on setting this up as his 20% project).  The test
> > matrix that I have him included data=journal, so we will be getting
> > better testing in the near future.
> > 
> > At least historically, data=journalling was the *simpler* case, and
> > was the first thing supported by ext4.  (data=ordered required revoke
> > handling which didn't land for six months or so).  So I'm not really
> > that convinced that removing really buys us that much code
> > simplification.
> > 
> > That being siad, it is true that data=journalled isn't necessarily
> > faster.  For heavy disk-bound workloads, it can be slower.  So I can
> > imagine adding some documentation that warns people not to use
> > data=journal unless they really know what they are doing, but at least
> > personally, I'm a bit reluctant to dispense with a bug report like
> > this by saying, "oh, that feature should be deprecated".
> 
> I suppose I should chime in here, since I'm the one who (potentially
> incorrectly) thinks I should be using data=journalled mode.
> 
> My basic impression is that the use of "data=journalled" can help
> reduce the risk (slightly) of serious corruption to some kinds of
> databases when the application does not provide appropriate syncs
> or journalling on its own (IE: such as text-based Wiki database files).
  It depends on the way such programs update the database files. But
generally yeas, data=journal provides a bit more guarantees than other
journaling modes - see below.

> Please correct me if this is horribly horribly wrong:
> 
> no journal:
>   Nothing is journalled
>   + Very fast.
>   + Works well for filesystems that are "mkfs"ed on every boot
>   - Have to fsck after every reboot
Fsck is needed only after a crash / hard powerdown. Otherwise completely
correct. Plus you always have a possibility of exposing uninitialized
(potentially sensitive) data after a fsck.

Actually, normal desktop might be quite happy with non-journaled filesystem
when fsck is fask enough.

> data=writeback:
>   Metadata is journalled, data (to allocated extents) may be written
>   before or after the metadata is updated with a new file size.
>   + Fast (not as fast as unjournalled)
>   + No need to "fsck" after a hard power-down
>   - A crash or power failure in the middle of a write could leave
>     old data on disk at the end of a file.  If security labeling
>     such as SELinux is enabled, this could "contaminate" a file with
>     data from a deleted file that was at a higher sensitivity.
>     Log files (including binary database replication logs) may be
>     effectively corrupted as a result.
Correct.

> data=ordered:
>   Data appended to a file will be written before the metadata
>   extending the length of the file is written, and in certain cases
>   the data will be written before file renames (partial ordering),
>   but the data itself is unjournalled, and may be only partially
>   complete for updates.
>   + Does not write data to the media twice
>   + A crash or power failure will not leave old uninitialized data
>     in files.
>   - Data writes to files may only partially complete in the event
>     of a crash.  No problems for logfiles, or self-journalled
>     application databases, but others may experience partial writes
>     in the event of a crash and need recovery.
Correct, one should also note that noone guarantees order in which data
hits the disk - i.e. when you do write(f,"a"); write(f,"b"); and these are
overwrites it may happen that "b" is written while "a" is not.

> data=journalled:
>   Data and metadata are both journalled, meaning that a given data
>   write will either complete or it will never occur, although the
>   precise ordering is not guaranteed.  This also implies all of the
>   data<=>metadata guarantees of data=ordered.
>   + Direct IO data writes are effectively "atomic", resulting in
>     less likelihood of data loss for application databases which do
>     not do their own journalling.  This means that a power failure
>     or system crash will not result in a partially-complete write.
Well, direct IO is atomic in data=journal the same way as in data=ordered.
It can happen only half of direct IO write is done when you hit power
button at the right moment - note this holds for overwrites.  Extending
writes or writes to holes are all-or-nothing for ext4 (again both in
data=journal and data=ordered mode).

>   - Cached writes are not atomic
>   + For small cached file writes (of only a few filesystem pages)
>     there is a good chance that kernel writeback will queue the
>     entire write as a single I/O and it will be "protected" as a
>     result.  This helps reduce the chance of serious damage to some
>     text-based database files (such as those for some Wikis), but
>     is obviously not a guarantee.
Page sized and page aligned writes are atomic (in both data=journal and
data=ordered modes). When a write spans multiple pages, there are chances
the writes will be merged in a single transaction but no guarantees as you
properly write.

>   - This writes all data to the block device twice (once to the FS
>     journal and once to the data blocks).  This may be especially bad
>     for write-limited Flash-backed devices.
Correct.

To sum up, the only additional guarantee data=journal offers against
data=ordered is a total ordering of all IO operations. That is, if you do a
sequence of data and metadata operations, then you are guaranteed that
after a crash you will see the filesystem in a state corresponding exactly
to your sequence terminated at some (arbitrary) point. Data writes are
disassembled into page-sized & page-aligned sequence of writes for purpose
of this model...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR



Reply to: