OT: AMD chips cause kernel errors and hangs?
We have a Linux cluster of 1000 nodes. I wasn't involved in setting it up.
They use RedHat 6.2 kernel 2.2.19. Dual AMD 1.2GHz, 2GB memory, 2GB swap,
GB ethernet.
Several nodes hang and/or get kernel errors every day. The first causes
that come to mind are bad RAM and running out of virtual memory. I've
pasted some logs below.
The slaves mostly run FORTRAN code compiled with Lahey F95 v6.0 and g77
(0.5.24-19981002).
What else could cause these errors? Are there special kernel config issues
for AMD chips?
I've run Linux for 9 years, always used Intel CPUs, used Debian since
before the first official release ("buzz"), but never heard of so many
problems.
ch_binary_handler+67/168] [do_execve+417/516] [sys_execve+75/124]
[system_call+52/56]
Aug 21 06:35:07 hou000752cs kernel: Code: f6 46 24 01 74 52 8b 4c 24 68 39
4e 14 75 49 8b 4c 24 64 31
Aug 21 06:35:07 hou000752cs inetd[458]: pid 11124: exit signal 11
Aug 21 06:35:07 hou000752cs kernel: Unable to handle kernel paging request
at virtual address 00ff0024
Aug 21 06:35:07 hou000752cs kernel: current->tss.cr3 = 1463e000, %cr3 =
1463e000
Aug 21 06:35:07 hou000752cs kernel: *pde = 00000000
Aug 21 06:35:07 hou000752cs kernel: Oops: 0000
Aug 21 06:35:07 hou000752cs kernel: CPU: 0
Aug 21 06:35:07 hou000752cs kernel: EIP:
0010:[locks_remove_posix+44/152]
Aug 21 06:35:07 hou000752cs kernel: EFLAGS: 00010206
Aug 21 06:35:07 hou000752cs kernel: eax: 94629b04 ebx: be6b35a0 ecx:
94629a94 edx: 947f6920
Aug 21 06:35:07 hou000752cs kernel: esi: 00ff0000 edi: 942157c0 ebp:
94629b04 esp: 93a9bc28
Aug 21 06:35:07 hou000752cs kernel: ds: 0018 es: 0018 ss: 0018
Aug 21 06:35:07 hou000752cs kernel: Process in.ftpd (pid: 11125, process
nr: 30, stackpage=93a9b000)
Aug 21 06:35:07 hou000752cs kernel: Stack: 942157c0 bcc13f60 94629b04
94629a94 8012699a 94785f00 93a9a000 94785f00
Aug 21 06:35:07 hou000752cs kernel: fffffff7 00000202 93f45aa0
00013000 93f45a40 2aabf000 93f45adc 80135619
Aug 21 06:35:07 hou000752cs kernel: 80135626 93f45a40 08085fc0
0806b800 00000000 bcc13f60 80126991 be6b35a0
Aug 21 06:35:07 hou000752cs kernel: Call Trace: [filp_close+82/92]
[load_elf_interp+677/708] [load_elf_interp+690/708] [filp
------------------------------------------------------------------------
Aug 21 04:02:00 hou000721cs anacron[5515]: Updated timestamp for job
`cron.daily' to 2001-08-21
Aug 21 04:02:01 hou000721cs kernel: Unable to handle kernel paging request
at virtual address 11008010
Aug 21 04:02:01 hou000721cs kernel: current->tss.cr3 = 145aa000, %cr3 =
145aa000
Aug 21 04:02:01 hou000721cs kernel: *pde = 00000000
Aug 21 04:02:01 hou000721cs kernel: Oops: 0000
Aug 21 04:02:01 hou000721cs kernel: CPU: 0
Aug 21 04:02:01 hou000721cs kernel: EIP: 0010:[d_lookup+100/224]
Aug 21 04:02:01 hou000721cs kernel: EFLAGS: 00010217
Aug 21 04:02:01 hou000721cs kernel: eax: beee9a88 ebx: 11007ff8 ecx:
00000022 edx: bee00000
Aug 21 04:02:01 hou000721cs kernel: esi: 322f6ef6 edi: ac72f00a ebp:
11008010 esp: 8542bf3c
Aug 21 04:02:01 hou000721cs kernel: ds: 0018 es: 0018 ss: 0018
Aug 21 04:02:01 hou000721cs kernel: Process slocate (pid: 5612, process
nr: 18, stackpage=8542b000)
Aug 21 04:02:01 hou000721cs kernel: Stack: ac72f00a 00000000 beee9a88
ac72f000 322f6ef6 0000000a 8012df0c aa7363e0
Aug 21 04:02:01 hou000721cs kernel: 8542bf84 8542bf84 8012e187
aa7363e0 8542bf84 00000000 ac72f000 ac72f000
Aug 21 04:02:01 hou000721cs kernel: 8542a000 7ffffc38 ac72f000
0000000a 322f6ef6 8012e284 ac72f000 aa7363e0
Aug 21 04:02:01 hou000721cs kernel: Call Trace: [cached_lookup+16/84]
[lookup_dentry+275/488] [__namei+40/88] [sys_newlstat+42/140]
[system_call+52/56]
Aug 21 04:02:01 hou000721cs kernel: Code: 8b 6d 00 8b 74 24 18 39 73 48 75
5c 8b 74 24 24 39 73 0c 75
------------------------------------------------------------------------
Aug 19 12:10:00 hou000669cs kernel: Unable to handle kernel paging request
at virtual address d2040200
Aug 19 12:10:00 hou000669cs kernel: current->tss.cr3 = 11c09000, %cr3 =
11c09000
Aug 19 12:10:00 hou000669cs kernel: *pde = 00000000
Aug 19 12:10:00 hou000669cs kernel: Oops: 0000
Aug 19 12:10:00 hou000669cs kernel: CPU: 0
Aug 19 12:10:00 hou000669cs kernel: EIP: 0010:[flush_old_exec+196/552]
Aug 19 12:10:00 hou000669cs kernel: EFLAGS: 00010246
Aug 19 12:10:00 hou000669cs kernel: eax: 00000000 ebx: 9b040000 ecx:
9b041e5c edx: 11c09000
Aug 19 12:10:00 hou000669cs kernel: esi: 00000000 edi: 801e59c3 ebp:
9a5c4000 esp: 9b041ca0
Aug 19 12:10:00 hou000669cs kernel: ds: 0018 es: 0018 ss: 0018
Aug 19 12:10:00 hou000669cs kernel: Process crond (pid: 15182, process nr:
24, stackpage=9b041000)
Aug 19 12:10:00 hou000669cs kernel: Stack: 801e59c3 befddf80 00000000
9b040000 80135d52 9b041e5c 8021e718 fffffff
8
Aug 19 12:10:00 hou000669cs kernel: 9b040000 00000000 00000000
00000000 00030003 00000001 00001990 0000003
4
Aug 19 12:10:00 hou000669cs kernel: 464c457f 00010101 00000000
00000080 9b041d6c befcf400 9b041da4 805427b
0
Aug 19 12:10:00 hou000669cs kernel: Call Trace: [cprt+1315/42661]
[load_elf_binary+1546/3480] [update_atime+94/10
0] [do_generic_file_read+1524/1536] [cprt+1312/42661]
[search_binary_handler+67/168] [do_execve+417/516]
Aug 19 12:10:00 hou000669cs kernel: [sys_execve+75/124]
[system_call+52/56]
Aug 19 12:10:00 hou000669cs kernel: Code: 66 39 83 00 02 00 00 75 29 8b 7c
24 14 66 8b 87 06 02 00 00
--------------------------------------------------------------------------
Aug 21 04:02:00 hou000587cs kernel: Unable to handle kernel NULL pointer
derefer
ence at virtual address 00000040
Aug 21 04:02:00 hou000587cs kernel: current->tss.cr3 = 20c50000, %cr3 =
20c50000
Aug 21 04:02:00 hou000587cs kernel: *pde = 00000000
Aug 21 04:02:00 hou000587cs kernel: Oops: 0000
Aug 21 04:02:00 hou000587cs kernel: CPU: 0
Aug 21 04:02:00 hou000587cs kernel: EIP: 0010:[dput+295/328]
Aug 21 04:02:00 hou000587cs kernel: EFLAGS: 00010286
Aug 21 04:02:00 hou000587cs kernel: eax: 00000000 ebx: 8aa1d680 ecx:
a14faf8
0 edx: a14fad7c
Aug 21 04:02:00 hou000587cs kernel: esi: ffffffff edi: 00001004 ebp:
0000000
1 esp: 9cf7be64
Aug 21 04:02:00 hou000587cs kernel: ds: 0018 es: 0018 ss: 0018
Aug 21 04:02:00 hou000587cs kernel: Process slocate (pid: 32162, process
nr: 30,
stackpage=9cf7b000)
Aug 21 04:02:00 hou000587cs kernel: Stack: 8aa1d680 80132c0c 8aa1d680
9cf7beb0 9
cf7beb0 8021e644 00001004 00001004
Aug 21 04:02:00 hou000587cs kernel: 80133d68 fffff7f6 00000806
00000000 8024a198 80
21e644 8024a198 a53672a0
Aug 21 04:02:00 hou000587cs kernel: a53672a0 00000000 98bea3fc
9cf7beb0 9cf7beb0 80
133df6 00001004 00000000
Aug 21 04:02:00 hou000587cs kernel: Call Trace: [prune_dcache+288/340]
[try_to_free_inodes
+316/396] [grow_inodes+30/440] [get_new_inode+197/312] [iget4+134/144]
[iget+19/24] [ext2_
lookup+84/124]
Aug 21 04:02:00 hou000587cs kernel: [real_lookup+80/160]
[lookup_dentry+296/488] [_
_namei+40/88] [sys_newlstat+42/140] [system_call+52/56]
Aug 21 04:02:00 hou000587cs kernel: Code: 8b 40 40 50 56 68 e0 55 1e 80 e8
7a 20 fe ff c7
05 00 00 00
...RickM...
Reply to: