[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Wheezy: mcelog not getting notified of ECC errors anymore?



After installing Wheezy (using FAI, so the setup is essentially unaltered),
one of my machines doesn't report memory errors via mcelog anymore. Error
messages go to syslog instead:

> Jun  3 09:47:07 testbed kernel: [231899.816038] [Hardware Error]: CPU:0	MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833
> Jun  3 09:47:07 testbed kernel: [231899.816282] [Hardware Error]: 	MC0_ADDR: 0x0000000076d39ec0
> Jun  3 09:47:07 testbed kernel: [231899.816377] [Hardware Error]: Data Cache Error: during system linefill.
> Jun  3 09:47:07 testbed kernel: [231899.816534] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:47:07 testbed kernel: [231899.816899] [Hardware Error]: CPU:0	MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd000400000000863
> Jun  3 09:47:07 testbed kernel: [231899.817136] [Hardware Error]: Bus Unit Error: PRF/ECC error in data read from NB: SRC.
> Jun  3 09:47:07 testbed kernel: [231899.817314] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout)
> Jun  3 09:47:07 testbed kernel: [231899.817677] [Hardware Error]: CPU:0	MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400200000813
> Jun  3 09:47:07 testbed kernel: [231899.817915] [Hardware Error]: 	MC4_ADDR: 0x000000007fafc410
> Jun  3 09:47:07 testbed kernel: [231899.818009] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
> Jun  3 09:47:07 testbed kernel: [231899.818189] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x7fafc410
> Jun  3 09:47:07 testbed kernel: [231899.818289] EDAC MC0: CE page 0x7fafc, offset 0x410, grain 0, syndrome 0xce, row 1, channel 0, label "": amd64_edac
> Jun  3 09:47:07 testbed kernel: [231899.818298] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> Jun  3 09:47:08 testbed kernel: [231900.804029] [Hardware Error]: CPU:1	MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833
> Jun  3 09:47:08 testbed kernel: [231900.804278] [Hardware Error]: 	MC0_ADDR: 0x000000007a673600
> Jun  3 09:47:08 testbed kernel: [231900.804371] [Hardware Error]: Data Cache Error: during system linefill.
> Jun  3 09:47:08 testbed kernel: [231900.804530] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:47:08 testbed kernel: [231900.804894] [Hardware Error]: CPU:1	MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd000400000000863
> Jun  3 09:47:08 testbed kernel: [231900.805130] [Hardware Error]: Bus Unit Error: PRF/ECC error in data read from NB: SRC.
> Jun  3 09:47:08 testbed kernel: [231900.810632] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout)
> Jun  3 09:52:07 testbed kernel: [232199.816039] [Hardware Error]: CPU:0	MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833
> Jun  3 09:52:07 testbed kernel: [232199.816284] [Hardware Error]: 	MC0_ADDR: 0x00000021086ea0c0
> Jun  3 09:52:07 testbed kernel: [232199.816378] [Hardware Error]: Data Cache Error: during system linefill.
> Jun  3 09:52:07 testbed kernel: [232199.816536] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:52:07 testbed kernel: [232199.816901] [Hardware Error]: CPU:0	MC2_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd400400000000813
> Jun  3 09:52:07 testbed kernel: [232199.817139] [Hardware Error]: 	MC2_ADDR: 0x0000000077ef0cc0
> Jun  3 09:52:07 testbed kernel: [232199.817232] [Hardware Error]: Bus Unit Error: RD/ECC error in data read from NB: SRC.
> Jun  3 09:52:07 testbed kernel: [232199.817409] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> Jun  3 09:52:07 testbed kernel: [232199.817771] [Hardware Error]: CPU:0	MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400200000813
> Jun  3 09:52:07 testbed kernel: [232199.818008] [Hardware Error]: 	MC4_ADDR: 0x000000007fafc410
> Jun  3 09:52:07 testbed kernel: [232199.818101] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
> Jun  3 09:52:07 testbed kernel: [232199.818282] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x7fafc410
> Jun  3 09:52:07 testbed kernel: [232199.818382] EDAC MC0: CE page 0x7fafc, offset 0x410, grain 0, syndrome 0xce, row 1, channel 0, label "": amd64_edac
> Jun  3 09:52:07 testbed kernel: [232199.818391] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> Jun  3 09:52:08 testbed kernel: [232200.804035] [Hardware Error]: CPU:1	MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833
> Jun  3 09:52:08 testbed kernel: [232200.804283] [Hardware Error]: 	MC0_ADDR: 0x000000007a673600
> Jun  3 09:52:08 testbed kernel: [232200.804377] [Hardware Error]: Data Cache Error: during system linefill.
> Jun  3 09:52:08 testbed kernel: [232200.804534] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:52:08 testbed kernel: [232200.804899] [Hardware Error]: CPU:1	MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd000400000000863
> Jun  3 09:52:08 testbed kernel: [232200.805136] [Hardware Error]: Bus Unit Error: PRF/ECC error in data read from NB: SRC.
> Jun  3 09:52:08 testbed kernel: [232200.805312] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout)

mcelog setup hasn't changed, actually /etc/mcelog/* is identical to a Squeeze
setup that works. Its logfile just stays at zero size.

I find it a bit hard to spot important information in the syslog records, in 
particular whether an ECC error has been corrected or not (and when to take 
action -> power off the node)

Obviously I have missed an important change (perhaps related to edac_* modules??),
how can I get back to mcelog?

Thanks,
 S


Reply to: