[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[BUG] machine check Oops on Alpha



Apologies in advance for the "poor" quality of this bug report.  No idea
how to proceed, because the issue historically has been intermittent to
non-existant for reasons unknown.

Within 24 hours of booting my Alpha (PWS 433au), I'm pretty much
guaranteed to see a "machine check" Oops which typically will occur
during a period of high disk activity (for example, during an "apt-get
update / upgrade".  If I want a huge mess to clean up afterward, "git
pull" on the kernel source tree will generally suffice as well :-(.

As long as the "Oops" trace doesn't include evidence of filesystem write
activity (calls to ext3/4 functions), the machine is perfectly stable
afterward for as long as I care to let it run -- days, weeks, whatever
-- no further Oopses will occur, regardless of how hard I flog the
machine.  A "bad" Oops will cause an immediate system lockup if any
process attempts to access the region of disk that was active at the
time the Oops occurred.

While a "machine check" is normally indicative of an underlying hardware
issue, the fact this is a one-time-per-boot issue has me thinking
otherwise.  I suspect a code path being traversed prior to the Oops that
gets bypassed afterward.  As previously mentioned, there have been months-
long intervals in the past where the issue has either been masked or non-
existent.  Currently, the issue has persisted through several 4.X kernel
release candidates and releases.

Attached is an example of precisely what I'm talking about as far as a
"good" Oops.  It occurred within a day of the last reboot, and the
machine has been running fine since.  Been flogging the devil out of it,
too: lots of updates (hundreds of megabytes), kernel builds, etc.

While any and all help tracking this down will be appreciated, please
know that kernel rebuilds (to turn on debugging or for whatever reason)
are an overnight affair on this system.  In other words, turnaround time
on diagnostic iterations involving kernel modifications will be slow.

--Bob
Apr  9 21:40:15 smirkin kernel: Unable to handle kernel paging request at virtual address 0000000000000010
Apr  9 21:40:15 smirkin kernel: dpkg-deb(19404): Oops 0
Apr  9 21:40:15 smirkin kernel: pc = [<fffffc0000316174>]  ra = [<fffffc000031df78>]  ps = 0007    Not tainted
Apr  9 21:40:15 smirkin kernel: pc is at process_mcheck_info+0x54/0x370
Apr  9 21:40:15 smirkin kernel: ra is at cia_machine_check+0x98/0xb0
Apr  9 21:40:15 smirkin kernel: v0 = 0000000000000004  t0 = 0000000000000000  t1 = 0000000000000001
Apr  9 21:40:15 smirkin kernel: t2 = 0000000000000630  t3 = fffffc0000d405f0  t4 = fffffc0000acf166
Apr  9 21:40:15 smirkin kernel: t5 = 00000000001fffff  t6 = 00000000ffffffff  t7 = fffffc005cf38000
Apr  9 21:40:15 smirkin kernel: s0 = 0000000000000000  s1 = fffffc0000c61750  s2 = 0000000000000000
Apr  9 21:40:15 smirkin kernel: s3 = 0000000000000000  s4 = fffffc0000cbcef0  s5 = fffffc0000d405d0
Apr  9 21:40:15 smirkin kernel: s6 = fffffc0000c7ef70
Apr  9 21:40:15 smirkin kernel: a0 = 0000000000000630  a1 = fffffc0000aca965  a2 = 0000000000000630
Apr  9 21:40:15 smirkin kernel: a3 = 0000000000000000  a4 = 0000000000000000  a5 = 0000000000000000
Apr  9 21:40:15 smirkin kernel: t8 = 000000000000001f  t9 = fffffc0000acbb38  t10= fffffc0000d40608
Apr  9 21:40:15 smirkin kernel: t11= 0000000000000000  pv = fffffc0000316120  at = 0000000000800000
Apr  9 21:40:15 smirkin kernel: gp = fffffc0000cabb38  sp = fffffc005cf3b978
Apr  9 21:40:15 smirkin kernel: Disabling lock debugging due to kernel taint
Apr  9 21:40:15 smirkin kernel: Trace:
Apr  9 21:40:15 smirkin kernel: [<fffffc000031df78>] cia_machine_check+0x98/0xb0
Apr  9 21:40:15 smirkin kernel: [<fffffc0000316100>] do_entInt+0x1c0/0x1e0
Apr  9 21:40:15 smirkin kernel: [<fffffc0000311340>] ret_from_sys_call+0x0/0x10
Apr  9 21:40:15 smirkin kernel: [<fffffc0000398ea4>] get_page_from_freelist+0x504/0xa10
Apr  9 21:40:15 smirkin kernel: [<fffffc00005aa410>] clear_page+0x0/0xc4
Apr  9 21:40:15 smirkin kernel: [<fffffc00005aa428>] clear_page+0x18/0xc4
Apr  9 21:40:15 smirkin kernel: [<fffffc000039949c>] __alloc_pages_nodemask+0xec/0xa00
Apr  9 21:40:15 smirkin kernel: [<fffffc00003b70a0>] wp_page_copy.isra.100+0x3c0/0x620
Apr  9 21:40:15 smirkin kernel: [<fffffc00003b6d3c>] wp_page_copy.isra.100+0x5c/0x620
Apr  9 21:40:15 smirkin kernel: [<fffffc00003b8828>] do_wp_page.isra.102+0x128/0x640
Apr  9 21:40:15 smirkin kernel: [<fffffc00003b8758>] do_wp_page.isra.102+0x58/0x640
Apr  9 21:40:15 smirkin kernel: [<fffffc000036377c>] current_fs_time+0x4c/0x70
Apr  9 21:40:15 smirkin kernel: [<fffffc00003bac6c>] handle_mm_fault+0x73c/0x1180
Apr  9 21:40:15 smirkin kernel: [<fffffc00003bb4f8>] handle_mm_fault+0xfc8/0x1180
Apr  9 21:40:15 smirkin kernel: [<fffffc000036bbe0>] timekeeping_update+0x130/0x200
Apr  9 21:40:15 smirkin kernel: [<fffffc0000365790>] hrtimer_run_queues+0x50/0x210
Apr  9 21:40:15 smirkin kernel: [<fffffc000031ec30>] do_page_fault+0x150/0x500
Apr  9 21:40:15 smirkin kernel: [<fffffc00003bde68>] find_vma+0x28/0xc0
Apr  9 21:40:15 smirkin kernel: [<fffffc000031ebb4>] do_page_fault+0xd4/0x500
Apr  9 21:40:15 smirkin kernel: [<fffffc00003734fc>] tick_periodic.constprop.17+0x3c/0xc0
Apr  9 21:40:15 smirkin kernel: [<fffffc000031eb9c>] do_page_fault+0xbc/0x500
Apr  9 21:40:15 smirkin kernel: [<fffffc0000328244>] __do_softirq+0x184/0x310
Apr  9 21:40:15 smirkin kernel: [<fffffc0000310f7c>] entMM+0x9c/0xc0
Apr  9 21:40:15 smirkin kernel: [<fffffc0000315e8c>] handle_irq+0x8c/0xf0
Apr  9 21:40:15 smirkin kernel: [<fffffc0000315f9c>] do_entInt+0x5c/0x1e0
Apr  9 21:40:15 smirkin kernel: 
Apr  9 21:40:15 smirkin kernel: Code: a53e0008  a55e0010  23de0020  6bfa8001  a55de018  47f00412 <a2890010> 261dffe2 

Reply to: