[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Hardware failure?: Now what?



Charles Curley <charlescurley@charlescurley.com> wrote:

> Mar 20 13:58:29 hawk rasdaemon[892]: Calling ras_mc_event_opendb()
> Mar 20 13:58:29 hawk rasdaemon[892]: cpu 03:rasdaemon: mce_record store: 0x55c124c9b148
> Mar 20 13:58:29 hawk kernel: [  300.407406] mce: [Hardware Error]: Machine check events logged
> Mar 20 13:58:29 hawk kernel: [  300.407410] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 0: 90000040000f0005
> Mar 20 13:58:29 hawk kernel: [  300.407411] mce: [Hardware Error]: TSC f442c87fda 
> Mar 20 13:58:29 hawk kernel: [  300.407413] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1616270309 SOCKET 0 APIC 6 microcode 19
> Mar 20 13:58:29 hawk rasdaemon[892]: rasdaemon: register inserted at db

> 1 2021-03-20 13:58:30 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0xf442c87fda, walltime=0x605653e5, cpu=0x00000003, cpuid=0x000306c3, apicid=0x00000006
> 2 2021-03-20 14:07:07 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0x274d9e61020, walltime=0x605655ea, cpu=0x00000003, cpuid=0x000306c3, apicid=0x00000006
> 3 2021-03-20 14:07:07 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0x27517a5dacb, walltime=0x605655eb, cpu=0x00000003, cpuid=0x000306c3, apicid=0x00000006
> 4 2021-03-20 14:10:34 -0600 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c09, status=0x90000040000f0005, tsc=0x30ea8517bee, walltime=0x605656b9, cpuid=0x000306c3

> If I read that correctly, CPU 3 is seeing and correcting internal parity
> errors.

Correct.

> The board is an ASUS H97M-E, bios date 05/15/2015. Processor is
> Intel(R) Core(TM) i7-4790S CPU @ 3.20GHz, with eight processors.

> Now what?

Nothing really.

Check if there is a BIOS/Firmware update available.

Check if the voltages are set correctly in the BIOS/Firmware. (Usually
by loading the defaults and setting everything to "auto".)

Check temperature of the CPU.

Check if the latest intel-microcode package from Debian is installed
(3.20201118.1~deb10u1 at the moment) or grab the newest one from testing
(3.20210216.1).

Try running mprime95 in test mode for some time to see if it complains
and if errors occur more often when under load.

Also run memtest86+ for some time to verify the correctness of your RAM.

In the end, if the error is something in one of the caches inside the
CPU, there is nothing really you can do.

Grüße,
Sven.

-- 
Sigmentation fault. Core dumped.


Reply to: