Wheezy: mcelog not getting notified of ECC errors anymore?

To: debian-amd64@lists.debian.org
Subject: Wheezy: mcelog not getting notified of ECC errors anymore?
From: Steffen Grunewald <Steffen.Grunewald@aei.mpg.de>
Date: Mon, 3 Jun 2013 10:11:54 +0200
Message-id: <[🔎] 20130603081154.GY20847@casco.aei.mpg.de>

After installing Wheezy (using FAI, so the setup is essentially unaltered),
one of my machines doesn't report memory errors via mcelog anymore. Error
messages go to syslog instead:

> Jun  3 09:47:07 testbed kernel: [231899.816038] [Hardware Error]: CPU:0	MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833
> Jun  3 09:47:07 testbed kernel: [231899.816282] [Hardware Error]: 	MC0_ADDR: 0x0000000076d39ec0
> Jun  3 09:47:07 testbed kernel: [231899.816377] [Hardware Error]: Data Cache Error: during system linefill.
> Jun  3 09:47:07 testbed kernel: [231899.816534] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:47:07 testbed kernel: [231899.816899] [Hardware Error]: CPU:0	MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd000400000000863
> Jun  3 09:47:07 testbed kernel: [231899.817136] [Hardware Error]: Bus Unit Error: PRF/ECC error in data read from NB: SRC.
> Jun  3 09:47:07 testbed kernel: [231899.817314] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout)
> Jun  3 09:47:07 testbed kernel: [231899.817677] [Hardware Error]: CPU:0	MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400200000813
> Jun  3 09:47:07 testbed kernel: [231899.817915] [Hardware Error]: 	MC4_ADDR: 0x000000007fafc410
> Jun  3 09:47:07 testbed kernel: [231899.818009] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
> Jun  3 09:47:07 testbed kernel: [231899.818189] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x7fafc410
> Jun  3 09:47:07 testbed kernel: [231899.818289] EDAC MC0: CE page 0x7fafc, offset 0x410, grain 0, syndrome 0xce, row 1, channel 0, label "": amd64_edac
> Jun  3 09:47:07 testbed kernel: [231899.818298] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> Jun  3 09:47:08 testbed kernel: [231900.804029] [Hardware Error]: CPU:1	MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833
> Jun  3 09:47:08 testbed kernel: [231900.804278] [Hardware Error]: 	MC0_ADDR: 0x000000007a673600
> Jun  3 09:47:08 testbed kernel: [231900.804371] [Hardware Error]: Data Cache Error: during system linefill.
> Jun  3 09:47:08 testbed kernel: [231900.804530] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:47:08 testbed kernel: [231900.804894] [Hardware Error]: CPU:1	MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd000400000000863
> Jun  3 09:47:08 testbed kernel: [231900.805130] [Hardware Error]: Bus Unit Error: PRF/ECC error in data read from NB: SRC.
> Jun  3 09:47:08 testbed kernel: [231900.810632] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout)
> Jun  3 09:52:07 testbed kernel: [232199.816039] [Hardware Error]: CPU:0	MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833
> Jun  3 09:52:07 testbed kernel: [232199.816284] [Hardware Error]: 	MC0_ADDR: 0x00000021086ea0c0
> Jun  3 09:52:07 testbed kernel: [232199.816378] [Hardware Error]: Data Cache Error: during system linefill.
> Jun  3 09:52:07 testbed kernel: [232199.816536] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:52:07 testbed kernel: [232199.816901] [Hardware Error]: CPU:0	MC2_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd400400000000813
> Jun  3 09:52:07 testbed kernel: [232199.817139] [Hardware Error]: 	MC2_ADDR: 0x0000000077ef0cc0
> Jun  3 09:52:07 testbed kernel: [232199.817232] [Hardware Error]: Bus Unit Error: RD/ECC error in data read from NB: SRC.
> Jun  3 09:52:07 testbed kernel: [232199.817409] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> Jun  3 09:52:07 testbed kernel: [232199.817771] [Hardware Error]: CPU:0	MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400200000813
> Jun  3 09:52:07 testbed kernel: [232199.818008] [Hardware Error]: 	MC4_ADDR: 0x000000007fafc410
> Jun  3 09:52:07 testbed kernel: [232199.818101] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
> Jun  3 09:52:07 testbed kernel: [232199.818282] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x7fafc410
> Jun  3 09:52:07 testbed kernel: [232199.818382] EDAC MC0: CE page 0x7fafc, offset 0x410, grain 0, syndrome 0xce, row 1, channel 0, label "": amd64_edac
> Jun  3 09:52:07 testbed kernel: [232199.818391] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> Jun  3 09:52:08 testbed kernel: [232200.804035] [Hardware Error]: CPU:1	MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833
> Jun  3 09:52:08 testbed kernel: [232200.804283] [Hardware Error]: 	MC0_ADDR: 0x000000007a673600
> Jun  3 09:52:08 testbed kernel: [232200.804377] [Hardware Error]: Data Cache Error: during system linefill.
> Jun  3 09:52:08 testbed kernel: [232200.804534] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:52:08 testbed kernel: [232200.804899] [Hardware Error]: CPU:1	MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd000400000000863
> Jun  3 09:52:08 testbed kernel: [232200.805136] [Hardware Error]: Bus Unit Error: PRF/ECC error in data read from NB: SRC.
> Jun  3 09:52:08 testbed kernel: [232200.805312] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout)

mcelog setup hasn't changed, actually /etc/mcelog/* is identical to a Squeeze
setup that works. Its logfile just stays at zero size.

I find it a bit hard to spot important information in the syslog records, in 
particular whether an ECC error has been corrected or not (and when to take 
action -> power off the node)

Obviously I have missed an important change (perhaps related to edac_* modules??),
how can I get back to mcelog?

Thanks,
 S

Reply to:

Follow-Ups:
- Re: Wheezy: mcelog not getting notified of ECC errors anymore?
  - From: Karl Schmidt <karl@xtronics.com>

Prev by Date: Re: RAID1 all bootable
Next by Date: Re: Wheezy: mcelog not getting notified of ECC errors anymore?
Previous by thread: Re: RAID1 all bootable
Next by thread: Re: Wheezy: mcelog not getting notified of ECC errors anymore?
Index(es):
- Date
- Thread