Those of you who saw my recent blog post [1] are, no doubt, waiting with baited breath for the return of our mipsel porting machine. Unfortunately, problems persist even after addressing the cooling problems that I initially believed were affecting the machine's stability. Vaughan will run for some time, but will eventually start misbehaving. It stays up longer if it's no under any load, but still does eventually go down. Here are some of the kernel dumps that it shows. These code dumps are from Linux 2.6.23.1, but similar problems occur in other kernels. Kernel bug detected[#2]: Cpu 0 $ 0 : 00000000 b0007c01 00000001 00003fff $ 4 : 810caa60 7fe9bf0a 80310000 000caa60 $ 8 : 00006553 7fe9bf0a 800f1098 00000000 $12 : 00000000 00000000 85811da0 746f6f72 $16 : 810caa60 8347f56c 0000000e 7fe9bf0a $20 : 811c11b8 803321e0 856d7e2c 856d7e28 $24 : 99999999 2ac30710 $28 : 856d6000 856d7da8 00000001 80089e2c Hi : 00000000 Lo : 00000000 epc : 8008ad9c kmap_coherent+0xc/0xe0 Tainted: G D ra : 80089e2c __flush_anon_page+0x4c/0x68 Status: b0007c03 KERNEL EXL IE Cause : 00808034 PrId : 000028a0 Process w (pid: 28428, threadinfo=856d6000, task=8116e928) Stack : 803321e0 8347f56c 0000000e 7fe9bf0a 800db0d0 800dad84 00000001 856d7ea0 800f18d0 00000000 00000011 00000000 00000030 00000000 803321e0 7fe9bf0a 866c8000 0000000f 000007ff 803321e0 00000000 856d7e28 856d7e2c 800db2b8 811c11b8 8116e928 000000d0 00000000 00000000 00000001 856d7e2c 856d7e28 00000000 810caa60 80332214 00000000 803321e0 00000000 0000000f 866c8000 ... Call Trace: [<8008ad9c>] kmap_coherent+0xc/0xe0 [<80089e2c>] __flush_anon_page+0x4c/0x68 [<800db0d0>] get_user_pages+0x3c4/0x4ac [<800db2b8>] access_process_vm+0x100/0x21c [<8012d91c>] proc_pid_cmdline+0xa4/0x14c [<8012f858>] proc_info_read+0x100/0x140 [<800f0b4c>] vfs_read+0xc0/0x160 [<800f10ec>] sys_read+0x54/0xa0 [<80088d0c>] stack_done+0x20/0x3c Code: 8c820000 00021242 30420001 <00028036> 8f820014 3c038035 24420001 af820014 8c629240 This is the first sign of trouble. The symptoms observable from userland are that just about any program that you try to run dies with a segfault. The machine never recovers from this state, and eventually gets worse: CPU 0 Unable to handle kernel paging request at virtual address 000000d0, epc == 800ebb34, ra == 800eb68c Oops[#4]: Cpu 0 $ 0 : 00000000 90007c00 8035dc08 000000d0 $ 4 : 8111fa80 83fdb990 0000002a 83fdb000 $ 8 : 8035dc00 00000000 00000001 00024000 $12 : 00000001 00080000 fff7ffff 00200200 $16 : 8035e694 00000021 8111fa80 00000000 $20 : 00024000 80350000 00200200 00100100 $24 : 00100100 00000000 $28 : 80378000 80379cd8 0000003c 800eb68c Hi : 00000036 Lo : 000000d8 epc : 800ebb34 free_block+0xec/0x1b0 Tainted: G D ra : 800eb68c cache_flusharray+0x74/0xfc Status: 90007c02 KERNEL EXL Cause : 0080800c BadVA : 000000d0 PrId : 000028a0 Process kswapd0 (pid: 72, threadinfo=80378000, task=8116fa08) Stack : 00808400 800cf650 90007c01 800b4334 0000003c 90007c01 00000000 8035e600 8035e610 80379da8 00000001 00000000 0000000d 800eb68c 819aae70 0000002a 87ead070 0000003a 8035e600 90007c01 8695e8c0 80379f48 00000001 800eb938 80355ca0 810d5a40 80379e74 80379f48 8695e8c0 00000001 80379e74 80116a58 80379e74 80379f48 00000001 80379da8 80116f30 80116f10 800d4c78 8101e2a0 ... Call Trace: [<800ebb34>] free_block+0xec/0x1b0 [<800eb68c>] cache_flusharray+0x74/0xfc [<800eb938>] kmem_cache_free+0x110/0x118 [<80116a58>] free_buffer_head+0x2c/0x48 [<80116f30>] try_to_free_buffers+0x6c/0xcc [<800d5330>] shrink_page_list+0x640/0x7fc [<800d573c>] shrink_zone+0x250/0xbfc [<800d6700>] kswapd+0x2ac/0x434 [<800b8658>] kthread+0x58/0x94 [<800835a4>] kernel_thread_helper+0x10/0x18 Code: 8ce30004 8ce20000 8c88004c <ac620000> ac430004 acf70000 acf60004 8ce2000c 8e440014 The call trace in this latter case isn't always the same, but free_block does always seem to be at the top of the stack. It's quite possible that this is a hardware problem. Do others concur? Is there any chance that it is software? If it is hardware, my inclination would be to suspect RAM. Does anybody have a decent source for Cobalt Raq2 memory? noah 1. http://nlm-morgul.livejournal.com/12188.html
Attachment:
signature.asc
Description: Digital signature