Thank you for most detailed instructions. On a global balance, I decided to
carry out a fresh install of amd64 to have ext3 as file system. You (and
general) strong advice to change to ext3 can not be ignored.
I am just downloading the amd64 net CD install built freshly daily, so that I
can also help the preparation of the beta3 release of the net install.
This does not mean that I can be sure about the genuinity of my hardware but
your examination of the signals I have produced suggests that it is. I have
postponed the examination of the harware because the disks are OK and
memories could be changed should they prove faulty. It seems the contrary of
what one normally does: hardware before software but I am not sure to arrive
at a conclusive test of my hardware.
I have not much to install besides base OS and a few tasks: jwd window
manager, sensors, compilers if needed, your compilation of mpqc, my
re-compilation of molecular mechanics (to carry out in any case because of
improvements to the code). That's about all.
I can anticipate that mpqc 2..3.1 proves great.
On Tuesday 18 July 2006 19:44, Lennart Sorensen wrote:
> On Tue, Jul 18, 2006 at 05:33:51PM +0200, Francesco Pietra wrote:
> > Not to insist any further on the relative merits of the various
> > filesystems, but in the general interest of maintaining amd64 (and
> > therefore of examinining parameters one at once, withouth mixing
> > problems), did you notice my e-mail of today emphasizing that after the
> > crash my data are intact? I wonder whether your suspicion about memory or
> > cpu may be the point. How to carry out a thourough memory test and
> > identifying which slot is defective, if any? Although Kingston ECC, one
> > of the eight slots (1GB each) might be defective.
> Well I have certainly seen a number of messages from people with
> opterons having memory problems over the last few months. The opteron
> seems to be very picky about memory quality, which makes some sense
> given have efficiently it uses it. It drives the memory quite hard.
> Simplest way I know of to test memory andd cpu, is to run a lot of large
> kernel compiles. Often a memory problem will cause that to segfault.
> Anything htat uses lots of cpu and lots of memory is usually a good
> test, at least if it fails spectacularly on an error, like gcc tends to
> To test the memory, remove half of it, and try the test. If it fails,
> replace one stick of memory with one of the other ones, until you can
> run the test without a problem. You could probably even run the test
> with 1 or 2 sticks of memory. A number of people have managed to find
> faulty memory on an opteron this way. Some people have come back going
> "I found a faulty stick of memory" after swearing that memtest86 had
> said all their ram was fine and they were sure their name brand ram
> wasn't faulty. :) memtest86 does't catch all errors. Of course with
> ECC memory I would have expected to see a machine check exception (MCE)
> if there was any single bit errors in the memory. I am still most
> inclined to blame reiserfs or perhaps the cpu. Of course since it was
> multiple errors all coming from reiserfs, with apparently nothing else
> seeing a problem, I really think it may simply be a reiserfs bug. I was
> using XFS before on early 2.6 kernels on i386, and even tually had to
> give up and move to ext3 since it just wasn't reliably on top of LVM on
> top of MD raid. The filesystem had some bad interaction with the LVM
> and MD raid that made it not work. It probably got fixed since, but I
> needed something that worked then, and ext3 worked.
> > What about checking the cpu? I can simply tell that I monitored the
> > temperature during the long calculation, with the machine in a strongly
> > ventilated area. Starting from 36C, the temp raised to 44C at maximum. I
> > don't know the correspondence with real temp ($sensors) but the
> > difference should tell. AMD for my 265 dual opterons indicates case
> > temperature 49-67C (is what I measured just case temp?). AMD also
> > indicate as temp limits 10-35, but I gues this should be the ambient
> > temperatures.
> That temperature is fine as far as I can tell.
> > Also, how to check thouroghly the disks?
> Well there is badblocks which allows disk testing. In my experience
> though, modern disks tend to either work or fail. They very rarely have
> small problems.
> Len Sorensen