[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

vaughan.debian.org



Those of you who saw my recent blog post [1] are, no doubt, waiting with
baited breath for the return of our mipsel porting machine.
Unfortunately, problems persist even after addressing the cooling
problems that I initially believed were affecting the machine's
stability.

Vaughan will run for some time, but will eventually start misbehaving.
It stays up longer if it's no under any load, but still does eventually
go down.  Here are some of the kernel dumps that it shows.  These code
dumps are from Linux 2.6.23.1, but similar problems occur in other
kernels.

Kernel bug detected[#2]:
Cpu 0
$ 0   : 00000000 b0007c01 00000001 00003fff
$ 4   : 810caa60 7fe9bf0a 80310000 000caa60
$ 8   : 00006553 7fe9bf0a 800f1098 00000000
$12   : 00000000 00000000 85811da0 746f6f72
$16   : 810caa60 8347f56c 0000000e 7fe9bf0a
$20   : 811c11b8 803321e0 856d7e2c 856d7e28
$24   : 99999999 2ac30710
$28   : 856d6000 856d7da8 00000001 80089e2c
Hi    : 00000000
Lo    : 00000000
epc   : 8008ad9c kmap_coherent+0xc/0xe0     Tainted: G      D
ra    : 80089e2c __flush_anon_page+0x4c/0x68
Status: b0007c03    KERNEL EXL IE
Cause : 00808034
PrId  : 000028a0
Process w (pid: 28428, threadinfo=856d6000, task=8116e928)
Stack : 803321e0 8347f56c 0000000e 7fe9bf0a 800db0d0 800dad84 00000001 856d7ea0
        800f18d0 00000000 00000011 00000000 00000030 00000000 803321e0 7fe9bf0a
        866c8000 0000000f 000007ff 803321e0 00000000 856d7e28 856d7e2c 800db2b8
        811c11b8 8116e928 000000d0 00000000 00000000 00000001 856d7e2c 856d7e28
        00000000 810caa60 80332214 00000000 803321e0 00000000 0000000f 866c8000
        ...
Call Trace:
[<8008ad9c>] kmap_coherent+0xc/0xe0
[<80089e2c>] __flush_anon_page+0x4c/0x68
[<800db0d0>] get_user_pages+0x3c4/0x4ac
[<800db2b8>] access_process_vm+0x100/0x21c
[<8012d91c>] proc_pid_cmdline+0xa4/0x14c
[<8012f858>] proc_info_read+0x100/0x140
[<800f0b4c>] vfs_read+0xc0/0x160
[<800f10ec>] sys_read+0x54/0xa0
[<80088d0c>] stack_done+0x20/0x3c


Code: 8c820000  00021242  30420001 <00028036> 8f820014  3c038035  24420001  af820014  8c629240

This is the first sign of trouble.  The symptoms observable from
userland are that just about any program that you try to run dies with a
segfault.  The machine never recovers from this state, and eventually
gets worse:

CPU 0 Unable to handle kernel paging request at virtual address
000000d0, epc == 800ebb34, ra == 800eb68c
Oops[#4]:
Cpu 0
$ 0   : 00000000 90007c00 8035dc08 000000d0
$ 4   : 8111fa80 83fdb990 0000002a 83fdb000
$ 8   : 8035dc00 00000000 00000001 00024000
$12   : 00000001 00080000 fff7ffff 00200200
$16   : 8035e694 00000021 8111fa80 00000000
$20   : 00024000 80350000 00200200 00100100
$24   : 00100100 00000000
$28   : 80378000 80379cd8 0000003c 800eb68c
Hi    : 00000036
Lo    : 000000d8
epc   : 800ebb34 free_block+0xec/0x1b0     Tainted: G      D
ra    : 800eb68c cache_flusharray+0x74/0xfc
Status: 90007c02    KERNEL EXL
Cause : 0080800c
BadVA : 000000d0
PrId  : 000028a0
Process kswapd0 (pid: 72, threadinfo=80378000, task=8116fa08)
Stack : 00808400 800cf650 90007c01 800b4334 0000003c 90007c01 00000000 8035e600
	8035e610 80379da8 00000001 00000000 0000000d 800eb68c 819aae70 0000002a
	87ead070 0000003a 8035e600 90007c01 8695e8c0 80379f48 00000001 800eb938
	80355ca0 810d5a40 80379e74 80379f48 8695e8c0 00000001 80379e74 80116a58
	80379e74 80379f48 00000001 80379da8 80116f30 80116f10 800d4c78 8101e2a0
	...
Call Trace:
[<800ebb34>] free_block+0xec/0x1b0
[<800eb68c>] cache_flusharray+0x74/0xfc
[<800eb938>] kmem_cache_free+0x110/0x118
[<80116a58>] free_buffer_head+0x2c/0x48
[<80116f30>] try_to_free_buffers+0x6c/0xcc
[<800d5330>] shrink_page_list+0x640/0x7fc
[<800d573c>] shrink_zone+0x250/0xbfc
[<800d6700>] kswapd+0x2ac/0x434
[<800b8658>] kthread+0x58/0x94
[<800835a4>] kernel_thread_helper+0x10/0x18

Code: 8ce30004  8ce20000  8c88004c <ac620000> ac430004  acf70000 acf60004  8ce2000c  8e440014

The call trace in this latter case isn't always the same, but free_block
does always seem to be at the top of the stack.

It's quite possible that this is a hardware problem.  Do others concur?
Is there any chance that it is software?  If it is hardware, my
inclination would be to suspect RAM.  Does anybody have a decent source
for Cobalt Raq2 memory?

noah

1. http://nlm-morgul.livejournal.com/12188.html

Attachment: signature.asc
Description: Digital signature


Reply to: