Re: apparent crashes persist.
> Once again, when it crashes, I can sometimes still manage to use a ssh
> connection to get in from elsewhere. What information should I collect,
> and how should I analyse it?
Have you tried using completly new ram from a different vendor or
different make (e.g. single sided instead of double sided or vice
versa)? We had a 256 nodes cluster where we found that the ram was
plain incompatible and had to swap 2048 DIMMs to a different vendor to
get any stability. Even then we still had a 5% failure rate per DIMM
in a week of stress testing.