Re: SGI O2 - Oops
Michael Dosser wrote:
> Hi,
>
> * On 2007-03-09 20:50 <ths@networkno.de> wrote:
>
> > A function at 0xffffffff8006fb3c in the kernel passed a bad pointer
> > (0x000000002abd8498) in the call to find_get_page. This looks like it is
> > a kernel bug. Your System.map file can tell you what function that was,
> > this helps probably a bit further.
>
> Thanks for this clarification. I have neither addresses in
> /boot/System.map-2.6.18:
>
> $ grep 8006fb3c /boot/System.map-2.6.18
> $ grep 2abd8498 /boot/System.map-2.6.18
> $
>
> Or am I searching at the wrong place?
The ..6fb3c is the place of execution _inside_ a function, so the next
lower number in System.map is the start of the function (where the
function's s ymbol is attached to).
> On Friday the machine looped with another Oops (I could only see this on
> the serial console) and did not respond to any network/console logins:
>
> Mem-info:
> DMA per-cpu:
> cpu 0 hot: high 186, batch 31 used:20
> cpu 0 cold: high 62, batch 15 used:56
> DMA32 per-cpu: empty
> Normal per-cpu: empty
> HighMem per-cpu: empty
> Free pages: 2708kB (0kB HighMem)
> Active:98243 inactive:118124 dirty:46 writeback:0 unstable:0 free:677 slab:31417 mapped:7733 pagetables:1319
> DMA free:2708kB min:5792kB low:7240kB high:8688kB active:392972kB inactive:472496kB present:2097152kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> DMA: 405*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 2708kB
> DMA32: empty
> Normal: empty
> HighMem: empty
> Swap cache: add 4212, delete 4212, find 2106/2640, race 0+0
> Free swap = 1020108kB
> Total swap = 1020108kB
> Free swap: 1020108kB
> 524288 pages of RAM
> 0 pages of HIGHMEM
> 271276 reserved pages
> 147134 pages shared
> 0 pages swap cached
> printk: 6 messages suppressed.
> apache_volume: page allocation failure. order:1, mode:0x21
> Call Trace:
> [<ffffffff80076efc>] __alloc_pages+0x21c/0x368
> [<ffffffff80098d74>] cache_alloc_refill+0x3c4/0x7f8
> [<ffffffff800992c4>] __kmalloc+0x11c/0x128
> [<ffffffff802b3efc>] __alloc_skb+0x8c/0x180
> [<ffffffff80237fe8>] meth_interrupt+0x610/0x8c0
> [<ffffffff80237f08>] meth_interrupt+0x530/0x8c0
> [<ffffffff8006c010>] handle_IRQ_event+0x78/0xf0
> [<ffffffff8006c1a0>] __do_IRQ+0x118/0x1c0
> [<ffffffff8000d16c>] timer_interrupt+0x1f4/0x480
> [<ffffffff8000957c>] do_IRQ+0x1c/0x38
> [<ffffffff8000797c>] ret_from_irq+0x0/0x10
> [<ffffffff801f43d0>] fbcon_cursor+0x0/0x400
> [<ffffffff80038800>] panic+0x250/0x2c0
> [<ffffffff80038828>] panic+0x278/0x2c0
> [<ffffffff8003df50>] do_exit+0x928/0xb48
> [<ffffffff8000e8c4>] die+0xec/0xf0
> [<ffffffff8000e8bc>] die+0xe4/0xf0
> [<ffffffff8000f098>] do_tr+0x0/0x120
> [<ffffffff800086b8>] handle_bp_int+0x20/0x28
> [<ffffffff800bef80>] d_callback+0x28/0x58
> [<ffffffff8009808c>] kfree+0x12c/0x138
> [<ffffffff800bef80>] d_callback+0x28/0x58
> [<ffffffff800524a0>] __rcu_process_callbacks+0xa8/0x3b0
> [<ffffffff800527e8>] rcu_process_callbacks+0x40/0x80
> [<ffffffff80041590>] tasklet_action+0xe8/0x1a8
> [<ffffffff80041590>] tasklet_action+0xe8/0x1a8
> [<ffffffff80040e6c>] __do_softirq+0xb4/0x188
> [<ffffffff80040fe0>] do_softirq+0xa0/0xa8
> [<ffffffff8000797c>] ret_from_irq+0x0/0x10
>
> I power cycled the machine and cross compiled a new kernel based on the
> linux-2.6-2.6.18.dfsg.1 sources. Kernel booted fine and this morning I get
> signal 11 errors from userland:
>
> $ uptime
> Segmentation fault
> $ strace uptime
> [...]
> open("/proc/uptime", O_RDONLY) = 3
> lseek(3, 0, SEEK_SET) = 0
> read(3, "186905.12 136515.89\n", 1023) = 20
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> +++ killed by SIGSEGV +++
> Process 3453 detached
>
> Are you sure this is not a memory/hardware problem? If it is not a hardware
> problem, what would you suggest to do? Shall I provide more information?
> Is this a known problem? If yes, is there a fix somewhere?
Now this looks more like broken RAM. You can try to re-seat the RAM
modules, if that doesn't help, try to find and remove the faulty
memory module.
Thiemo
Reply to: