Bug#743458: linux-image-3.13-1-686-pae: linux-image-686-pae / eeepc 701: mce: [Hardware Error]

To: Raphaël Droz <raphael.droz@gmail.com>, 743458@bugs.debian.org
Subject: Bug#743458: linux-image-3.13-1-686-pae: linux-image-686-pae / eeepc 701: mce: [Hardware Error]
From: Ben Hutchings <ben@decadent.org.uk>
Date: Thu, 03 Apr 2014 02:55:45 +0100
Message-id: <[🔎] 1396490145.22689.65.camel@deadeye.wl.decadent.org.uk>
Reply-to: Ben Hutchings <ben@decadent.org.uk>, 743458@bugs.debian.org
In-reply-to: <[🔎] 20140402233316.6538.15917.reportbug@localhost.local>
References: <[🔎] 20140402233316.6538.15917.reportbug@localhost.local>

Control: tag -1 moreinfo

On Thu, 2014-04-03 at 01:33 +0200, Raphaël Droz wrote:
> Package: src:linux
> Version: 3.13.5-1
> Severity: normal
> 
> Dear Maintainer,
> 
> my system regularly hangs (like once a day or once every two days).
> Using netconsole I was able to grab a log of the failure.
> I interpret this as a mce error which triggers kernel panic in chain.
> I don't attach the full log (which lasted until I manually stop the machine)
> since later backtraces seem redundant with the first ones:
> 
> 
> 
> Apr  2 13:25:17 192.168.0.4 [ 2984.381126] Suspending console(s) (use no_console_suspend to debug)
> [ had resumed from a 12 hours long hibernation, just a couple of minutes ago ]
> [ and here happens the crash, while they were no specific/intensive activity: ]
> 
> Apr  3 00:09:28 192.168.0.4 [ 3038.509437] Disabling lock debugging due to kernel taint
> Apr  3 00:09:28 192.168.0.4 [ 3038.509482] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 1: b200000000000125
> Apr  3 00:09:28 192.168.0.4 [ 3038.509716] mce: [Hardware Error]: TSC 6ecec0fdf 
> Apr  3 00:09:28 192.168.0.4 3038.509716] mce: [Hardware Error]: TSC 6ecec0fdf  Check Exception: 4 Bank 1: b200000000000125
> 
> Apr  3 00:09:28 192.168.0.4 [ 3038.509849] mce: [Hardware Error]: PROCESSOR 0:6d8 TIME 1396476570 SOCKET 0 APIC 0 microcode 20
> Apr  3 00:09:28 192.168.0.4 [ 3038.510064] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> Apr  3 00:09:28 192.168.0.4 [ 3038.510236] mce: [Hardware Error]: Machine check: Invalid
> Apr  3 00:09:28 192.168.0.4 [ 3038.510374] Kernel panic - not syncing: Fatal machine check on current CPU
[...]
> - As far as I understood this post:
>   http://forum.mepiscommunity.org/viewtopic.php?p=317898
>   the kernel should not crash and, on 32bits, could safely ignore this hardware error.

No, that is nonsense.  MCEs cannot be ignored and are not specific to
x86_64.  In some cases the kernel may be able to recover from them, but
apparently not in this case.

Usually an MCE is due to faulty hardware.

Does this often happen shortly after resuming from hibernation?
Or was that just when it happened on this occasion?

Please install the mcelog package, run 'mcelog --ascii' (as root) and
paste these log lines in the terminal:

[ 3038.509482] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 1: b200000000000125
[ 3038.509716] mce: [Hardware Error]: TSC 6ecec0fdf
[ 3038.509849] mce: [Hardware Error]: PROCESSOR 0:6d8 TIME 1396476570 SOCKET 0 APIC 0 microcode 20

This may be able to identify a memory module that is at fault.

Ben.

-- 
Ben Hutchings
The generation of random numbers is too important to be left to chance.
                                                            - Robert Coveyou

Reply to:

References:
- Bug#743458: linux-image-3.13-1-686-pae: linux-image-686-pae / eeepc 701: mce: [Hardware Error]
  - From: Raphaël Droz <raphael.droz@gmail.com>

Prev by Date: Bug#738113: fixed in 3.13.6-1
Next by Date: Processed: Re: Bug#743458: linux-image-3.13-1-686-pae: linux-image-686-pae / eeepc 701: mce: [Hardware Error]
Previous by thread: Bug#743458: linux-image-3.13-1-686-pae: linux-image-686-pae / eeepc 701: mce: [Hardware Error]
Next by thread: Processed: Re: Bug#743458: linux-image-3.13-1-686-pae: linux-image-686-pae / eeepc 701: mce: [Hardware Error]
Index(es):
- Date
- Thread