[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: reiserfs/md1/failure/threads



Thank you for most detailed instructions. On a global balance, I decided to 
carry out a fresh install of amd64 to have ext3 as file system. You (and 
general) strong advice to change to ext3 can not be ignored.

I am just downloading the amd64 net CD install built freshly daily, so that I 
can also help the preparation of the beta3 release of the net install.

This does not mean that I can be sure about the genuinity of my hardware but 
your examination of the signals I have produced suggests that it is. I have 
postponed the examination of the harware because the disks are OK and 
memories could be changed should they prove faulty. It seems the contrary of 
what one normally does: hardware before software but I am not sure to arrive 
at a conclusive test of my hardware.

I have not much to install besides base OS and a few tasks: jwd window 
manager, sensors, compilers if needed, your compilation of mpqc, my 
re-compilation of molecular mechanics (to carry out in any case because of 
improvements to the code). That's about all.

I can anticipate that mpqc 2..3.1 proves great.

Thanks again

francesco


On Tuesday 18 July 2006 19:44, Lennart Sorensen wrote:
> On Tue, Jul 18, 2006 at 05:33:51PM +0200, Francesco Pietra wrote:
> > Not to insist any further on the relative merits of the various
> > filesystems, but in the general interest of maintaining amd64 (and
> > therefore of examinining parameters one at once, withouth mixing
> > problems), did you notice my e-mail of today emphasizing that after the
> > crash my data are intact? I wonder whether your suspicion about memory or
> > cpu may be the point. How to carry out a thourough memory test and
> > identifying which slot is defective, if any? Although Kingston ECC, one
> > of the eight slots (1GB each) might be defective.
>
> Well I have certainly seen a number of messages from people with
> opterons having memory problems over the last few months.  The opteron
> seems to be very picky about memory quality, which makes some sense
> given have efficiently it uses it.  It drives the memory quite hard.
>
> Simplest way I know of to test memory andd cpu, is to run a lot of large
> kernel compiles.  Often a memory problem will cause that to segfault.
> Anything htat uses lots of cpu and lots of memory is usually a good
> test, at least if it fails spectacularly on an error, like gcc tends to
> do.
>
> To test the memory, remove half of it, and try the test.  If it fails,
> replace one stick of memory with one of the other ones, until you can
> run the test without a problem.  You could probably even run the test
> with 1 or 2 sticks of memory.  A number of people have managed to find
> faulty memory on an opteron this way.  Some people have come back going
> "I found a faulty stick of memory" after swearing that memtest86 had
> said all their ram was fine and they were sure their name brand ram
> wasn't faulty. :)  memtest86 does't catch all errors.  Of course with
> ECC memory I would have expected to see a machine check exception (MCE)
> if there was any single bit errors in the memory.  I am still most
> inclined to blame reiserfs or perhaps the cpu.  Of course since it was
> multiple errors all coming from reiserfs, with apparently nothing else
> seeing a problem, I really think it may simply be a reiserfs bug.  I was
> using XFS before on early 2.6 kernels on i386, and even tually had to
> give up and move to ext3 since it just wasn't reliably on top of LVM on
> top of MD raid.  The filesystem had some bad interaction with the LVM
> and MD raid that made it not work.  It probably got fixed since, but I
> needed something that worked then, and ext3 worked.
>
> > What about checking the cpu? I can simply tell that I monitored the
> > temperature during the long calculation, with the machine in a strongly
> > ventilated area. Starting from 36C, the temp raised to 44C at maximum. I
> > don't know the correspondence with real temp ($sensors) but the
> > difference should tell. AMD for my 265 dual opterons indicates case
> > temperature 49-67C (is what I measured just case temp?). AMD also
> > indicate as temp limits 10-35, but I gues this should be the ambient
> > temperatures.
>
> That temperature is fine as far as I can tell.
>
> > Also, how to check thouroghly the disks?
>
> Well there is badblocks which allows disk testing.  In my experience
> though, modern disks tend to either work or fail.  They very rarely have
> small problems.
>
> --
> Len Sorensen



Reply to: