[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#625922: SATA devices get reset without real hardware failure



El 26/11/2011, a las 07:49, Jonathan Nieder escribió:

> Hi,
> 
> Natalia Portillo wrote:
> 
>> While running stock Debian's sid linux 2.6.38-8-amd64 kernel I'm
>> getting random fails on SATA devices.
>> 
>> I have a RAID5 system with 5 disks and 3 of them showed the same
>> exact failure, one each 48 hours.
>> 
>> On reboot, the devices work perfectly, and badblocks runs through
>> them without a single failure.
>> 
>> Kernel exact failure is:
>> 
>> [255352.928063] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> [255352.928071] ata4.00: failed command: FLUSH CACHE EXT
> [...]
>> Devices are in different SATA ports (first failed ata2, then ata5,
>> then ata4) and are all Seagate ST2000DL003-9VT166.
>> 
>> Same exact hardware has been running on Linux 2.6.32-gentoo for
>> weeks without a single failure.
> 
> Thanks for reporting it, and sorry for the slow response.
> 
> Some questions:
> 
> - what kernel are you using now?

claunia@hades:~$ uname -a
Linux hades 3.0.0-1-amd64 #1 SMP Sat Aug 27 16:21:11 UTC 2011 x86_64 GNU/Linux

wheezy

> - can you still reproduce this?

have been only two weeks with this kernel, and there is a bug, another one

> - can you reproduce it with a squeeze kernel, too?

with all squeeze kernels up to two weeks away

> - do you know what exact version the working 2.6.32-gentoo kernel
>   was?

r6 I think

> - please attach a log of the initialization of the kernel, either by
>   saving full "dmesg" output right after booting or by gathering it
>   from /var/log/dmesg*

I will have to dig up on the rotated logs, stay tuned

> - any workarounds or other weird symptoms?

Curiously, no workarounds, but other weird symptons in same and other kernels.

On both squeeze and wheezy kernel the following happen almost once a day (always on high network transfers):

[118801.372070] INFO: task bacula-sd:27996 blocked for more than 120 seconds.
[118801.372091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[118801.372113] bacula-sd       D ffff88009f63a2c0     0 27996      1 0x00000000
[118801.372122]  ffff88009f63a2c0 0000000000000082 0000000000000000 ffff880000000000
[118801.372130]  ffff8800bc3780c0 0000000000012800 ffff88008954dfd8 ffff88008954dfd8
[118801.372138]  0000000000012800 ffff88009f63a2c0 0000000000012800 0000000000012800
[118801.372146] Call Trace:
[118801.372161]  [<ffffffff81335bb1>] ? schedule_timeout+0x2d/0xd7
[118801.372170]  [<ffffffff8119287b>] ? blk_peek_request+0x1a7/0x1bc
[118801.372176]  [<ffffffff8133598b>] ? wait_for_common+0x9d/0x116
[118801.372184]  [<ffffffff8103f0a4>] ? try_to_wake_up+0x199/0x199
[118801.372190]  [<ffffffff81336a8d>] ? _raw_spin_lock_irq+0xd/0x1a
[118801.372218]  [<ffffffffa00fdab9>] ? st_do_scsi.clone.10+0x2d9/0x309 [st]
[118801.372228]  [<ffffffffa00fe384>] ? st_int_ioctl+0x673/0xad5 [st]
[118801.372234]  [<ffffffff8103aec8>] ? mmdrop+0xd/0x1c
[118801.372241]  [<ffffffff8103840a>] ? should_resched+0x5/0x24
[118801.372250]  [<ffffffffa0100689>] ? st_ioctl+0xb5e/0xedf [st]
[118801.372259]  [<ffffffff81062efc>] ? hrtimer_try_to_cancel+0x3c/0x46
[118801.372265]  [<ffffffff81062f12>] ? hrtimer_cancel+0xc/0x16
[118801.372272]  [<ffffffff8110905d>] ? do_vfs_ioctl+0x45b/0x49c
[118801.372278]  [<ffffffff81062b83>] ? update_rmtp+0x62/0x62
[118801.372284]  [<ffffffff81063279>] ? hrtimer_start_expires+0x16/0x1b
[118801.372290]  [<ffffffff811090e9>] ? sys_ioctl+0x4b/0x72
[118801.372297]  [<ffffffff8133bd12>] ? system_call_fastpath+0x16/0x1b

And repeats a lot of times (the stack trace is always different, always being the process that's doing the transfer, like bacula-sd or netatalk, or the XFS or MDRAID processes)

On squeeze kernel when this happens nothing works. That is, if you open another processes, it does not open. If you kill one process, it stays opened. Hard reboot is the only way.
On wheezy system continues working.

Curiously I received an Efika MX Smartbook machine yesterday that exhibits another bug, but really similar.

With kernel Linux 2.6.31.14.26-efikamx the internal SSD suffers a lost interrupt and resets when there is high cpu usage. Sorry have to dig logs also.

> 
> If you can reproduce this reliably with a 3.1.y kernel, we should
> take this upstream (looks like that's linux-ide@vger.kernel.org
> plus linux-kernel@vger.kernel.org; please cc me or this bug log if
> writing there so we can track it).
> 
> Hope that helps,
> Jonathan




Reply to: