[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#274988: XFS crash in kernel-image-2.6.8-1-686-smp



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Package: kernel-image-2.6.8-1-686-smp
Version: 2.6.8-3
Severity: Important

Hardware layout
- ---------------------
Dual Xeon  + Latest BIOS
1 GB ram
2 x 3ware SATA raid controllers + Latest Firmware

All disks live on the 3ware 9xxx controllers
Controllers provides 3 x 1.5TB raid-5 stripes
One of which holds /, swap and /var.

The rest of the free space I've built as a 4.5TB raid-0 stripe
for the backup volume

This is then carved into.....
- ----------------------------------------------------------------------------------------
backup-srv:~# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1              3937220   2285880   1451336  62% /
/dev/sda3              3937252   2497220   1240024  67% /var
/dev/md0             4677217408 3090281796 1586935612  67% /backups


- ----------------------------------------------------------------------------------------


The /dev/md0 device is ....
- ----------------------------------------------------------------------------------------
backup-srv:~# cat /proc/mdstat
md0 : active raid0 sdc1[2] sdb1[1] sda4[0]
      4677348544 blocks 64k chunks

unused devices: <none>

- ----------------------------------------------------------------------------------------

I had to use XFS as this was the only FS that would build that large.
ext3 seems to barf at anything over 2TB



Problem Description
- -------------------------
This machine is the production backup server for all the *nix machines on
the network. cron runs rsync via ssh to grab the files from each client target
The bulk of the systems are backed up weekly and a few daily
The system seems to survive anywhere between a couple of days to no more
that 2 weeks under this sort of heavy IO & network loading before 
giving up the ghost. dmesg dumps follow......

This problem was also exhibited by 2.6.7 and 2.6.6

I'm dropping back to 2.4.27 now & will let you know if pain persists



- From Dmesg
- ----------------------------------------------------------------------------------------
Unable to handle kernel paging request at virtual address 20fda90c
 printing eip:
f8b26144
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in: af_packet ipv6 piix hw_random uhci_hcd usbcore shpchp 
pciehp pci_hotplug floppy parport_pc parport pcspkr evdev e1000 xfs raid0 md 
dm_mod ide_cd ide_core cdrom rtc ext3 jbd mbcache sd_mod unix 3w_9xxx 
scsi_mod
CPU:    3
EIP:    0060:[<f8b26144>]    Not tainted
EFLAGS: 00010213   (2.6.8.20040927)
EIP is at xfs_ail_insert+0x24/0xd0 [xfs]
eax: 000003e7   ebx: 00000000   ecx: 000003e7   edx: 00000000
esi: 20fda904   edi: f7198c18   ebp: c2005168   esp: f7703dd4
ds: 007b   es: 007b   ss: 0068
Process xfslogd/3 (pid: 604, threadinfo=f7702000 task=f7cb87d0)
Stack: 0002050a 0000052a 549b2041 ed9cd202 c2005168 f7198c18 f7198c00 c0f1d30c
       f8b25e5d f7198c18 c2005168 00000000 c2005168 0002050a 0000052a 00000000
       c2005168 0002050a 0000052a f8b258bc f7198c00 c2005168 0002050a 0000052a
Call Trace:
 [<f8b25e5d>] xfs_trans_update_ail+0x5d/0xf0 [xfs]
 [<f8b258bc>] xfs_trans_chunk_committed+0x17c/0x240 [xfs]
 [<f8b2566a>] xfs_trans_committed+0x4a/0x120 [xfs]
 [<f8b17743>] xlog_state_do_callback+0x2c3/0x3d0 [xfs]
 [<f8b178d0>] xlog_state_done_syncing+0x80/0xc0 [xfs]
 [<f8b15fe5>] xlog_iodone+0x55/0xf0 [xfs]
 [<f8b359bd>] pagebuf_iodone_work+0x4d/0x50 [xfs]
 [<c0131a26>] worker_thread+0x1f6/0x2e0
 [<f8b35970>] pagebuf_iodone_work+0x0/0x50 [xfs]
 [<c011c4f0>] default_wake_function+0x0/0x20
 [<c011c4f0>] default_wake_function+0x0/0x20
 [<c0131830>] worker_thread+0x0/0x2e0
 [<c0135f8a>] kthread+0xba/0xc0
 [<c0135ed0>] kthread+0x0/0xc0
 [<c01042c5>] kernel_thread_helper+0x5/0x10
Code: 8b 46 08 8b 56 0c 89 44 24 08 89 54 24 0c 8b 55 0c 8b 45 08
 <6>note: xfslogd/3[604] exited with preempt_count 1
- ----------------------------------------------------------------------------------------


Machine locks up a little while after this & after a kick in the guts gives
on next startup....
- ----------------------------------------------------------------------------------------
backup-srv:~# mount /backups/
Oct  4 12:47:03 ouprci05 kernel: Filesystem "md0": XFS internal error 
xlog_clear_stale_blocks(2) at line 1253 of file fs/xfs/xfs_log_recover.c.  
Caller 0xf8b28876
Oct  4 12:47:03 ouprci01 kernel: Filesystem "md0": XFS internal error 
xlog_clear_stale_blocks(2) at line 1253 of file fs/xfs/xfs_log_recover.c.  
Caller 0xf8b28876
mount: Unknown error 990
- ----------------------------------------------------------------------------------------


So I try.....
- ----------------------------------------------------------------------------------------
backup-srv:~# xfs_repair /dev/md0
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
- ----------------------------------------------------------------------------------------


So I ....
- ----------------------------------------------------------------------------------------
backup-srv:~# xfs_repair -L /dev/md0
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
LEAFN node level is 1 inode 2820138 bno = 8388608

entry contains offset out of order in shortform dir 19126020
corrected entry offsets in directory 19126020
        - agno = 1
        - agno = 2
LEAFN node level is 1 inode 2147942164 bno = 8388608
LEAFN node level is 1 inode 2148480815 bno = 8388608
....
And so on for a few hours, for  the rest of the 4.5TB file system check to 
complete :(
....




________________________________
It is by caffeine alone I set my mind in motion,
It is by the beans of Java that thoughts acquire speed,
The hands acquire shaking, the shaking becomes a warning,
It is by caffeine alone I set my mind in motion.
(author unknown)
with thanks and apologies to Frank Herbert
________________________________
Jan Eringa
Unix Admin
Orbian Management Ltd
________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFBYl6XX4LWCZ7JjaMRAtH0AJwPIxdCA6xO88hHtJa27qo7UBlG/QCgigGI
dhtLCXAxPd1W46KbnFMdMcY=
=nuOo
-----END PGP SIGNATURE-----



Reply to: