[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: SGI O2 - Oops



Michael Dosser wrote:
> Hi,
> 
> * On 2007-03-09 20:50 <ths@networkno.de> wrote:
> 
> > A function at 0xffffffff8006fb3c in the kernel passed a bad pointer
> > (0x000000002abd8498) in the call to find_get_page. This looks like it is
> > a kernel bug. Your System.map file can tell you what function that was,
> > this helps probably a bit further.
> 
> Thanks for this clarification. I have neither addresses in
> /boot/System.map-2.6.18:
> 
> $ grep 8006fb3c /boot/System.map-2.6.18
> $ grep 2abd8498 /boot/System.map-2.6.18
> $
> 
> Or am I searching at the wrong place?

The ..6fb3c is the place of execution _inside_ a function, so the next
lower number in System.map is the start of the function (where the
function's s ymbol is attached to).

> On Friday the machine looped with another Oops (I could only see this on
> the serial console) and did not respond to any network/console logins:
> 
> Mem-info:
> DMA per-cpu:
> cpu 0 hot: high 186, batch 31 used:20
> cpu 0 cold: high 62, batch 15 used:56
> DMA32 per-cpu: empty
> Normal per-cpu: empty
> HighMem per-cpu: empty
> Free pages:        2708kB (0kB HighMem)
> Active:98243 inactive:118124 dirty:46 writeback:0 unstable:0 free:677 slab:31417 mapped:7733 pagetables:1319
> DMA free:2708kB min:5792kB low:7240kB high:8688kB active:392972kB inactive:472496kB present:2097152kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> DMA: 405*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 2708kB
> DMA32: empty
> Normal: empty
> HighMem: empty
> Swap cache: add 4212, delete 4212, find 2106/2640, race 0+0
> Free swap  = 1020108kB
> Total swap = 1020108kB
> Free swap:       1020108kB
> 524288 pages of RAM
> 0 pages of HIGHMEM
> 271276 reserved pages
> 147134 pages shared
> 0 pages swap cached
> printk: 6 messages suppressed.
> apache_volume: page allocation failure. order:1, mode:0x21
> Call Trace:
>  [<ffffffff80076efc>] __alloc_pages+0x21c/0x368
>  [<ffffffff80098d74>] cache_alloc_refill+0x3c4/0x7f8
>  [<ffffffff800992c4>] __kmalloc+0x11c/0x128
>  [<ffffffff802b3efc>] __alloc_skb+0x8c/0x180
>  [<ffffffff80237fe8>] meth_interrupt+0x610/0x8c0
>  [<ffffffff80237f08>] meth_interrupt+0x530/0x8c0
>  [<ffffffff8006c010>] handle_IRQ_event+0x78/0xf0
>  [<ffffffff8006c1a0>] __do_IRQ+0x118/0x1c0
>  [<ffffffff8000d16c>] timer_interrupt+0x1f4/0x480
>  [<ffffffff8000957c>] do_IRQ+0x1c/0x38
>  [<ffffffff8000797c>] ret_from_irq+0x0/0x10
>  [<ffffffff801f43d0>] fbcon_cursor+0x0/0x400
>  [<ffffffff80038800>] panic+0x250/0x2c0
>  [<ffffffff80038828>] panic+0x278/0x2c0
>  [<ffffffff8003df50>] do_exit+0x928/0xb48
>  [<ffffffff8000e8c4>] die+0xec/0xf0
>  [<ffffffff8000e8bc>] die+0xe4/0xf0
>  [<ffffffff8000f098>] do_tr+0x0/0x120
>  [<ffffffff800086b8>] handle_bp_int+0x20/0x28
>  [<ffffffff800bef80>] d_callback+0x28/0x58
>  [<ffffffff8009808c>] kfree+0x12c/0x138
>  [<ffffffff800bef80>] d_callback+0x28/0x58
>  [<ffffffff800524a0>] __rcu_process_callbacks+0xa8/0x3b0
>  [<ffffffff800527e8>] rcu_process_callbacks+0x40/0x80
>  [<ffffffff80041590>] tasklet_action+0xe8/0x1a8
>  [<ffffffff80041590>] tasklet_action+0xe8/0x1a8
>  [<ffffffff80040e6c>] __do_softirq+0xb4/0x188
>  [<ffffffff80040fe0>] do_softirq+0xa0/0xa8
>  [<ffffffff8000797c>] ret_from_irq+0x0/0x10
> 
> I power cycled the machine and cross compiled a new kernel based on the 
> linux-2.6-2.6.18.dfsg.1 sources. Kernel booted fine and this morning I get
> signal 11 errors from userland:
> 
> $ uptime
> Segmentation fault
> $ strace uptime
> [...]
> open("/proc/uptime", O_RDONLY)          = 3
> lseek(3, 0, SEEK_SET)                   = 0
> read(3, "186905.12 136515.89\n", 1023)  = 20
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> +++ killed by SIGSEGV +++
> Process 3453 detached
> 
> Are you sure this is not a memory/hardware problem? If it is not a hardware
> problem, what would you suggest to do? Shall I provide more information?
> Is this a known problem? If yes, is there a fix somewhere?

Now this looks more like broken RAM. You can try to re-seat the RAM
modules, if that doesn't help, try to find and remove the faulty
memory module.


Thiemo



Reply to: