[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: reiserfs/md1/failure/threads



On Tue, Jul 18, 2006 at 05:33:51PM +0200, Francesco Pietra wrote:
> Not to insist any further on the relative merits of the various filesystems, 
> but in the general interest of maintaining amd64 (and therefore of 
> examinining parameters one at once, withouth mixing problems), did you notice 
> my e-mail of today emphasizing that after the crash my data are intact? I 
> wonder whether your suspicion about memory or cpu may be the point. How to 
> carry out a thourough memory test and identifying which slot is defective, if 
> any? Although Kingston ECC, one of the eight slots (1GB each) might be 
> defective.

Well I have certainly seen a number of messages from people with
opterons having memory problems over the last few months.  The opteron
seems to be very picky about memory quality, which makes some sense
given have efficiently it uses it.  It drives the memory quite hard.

Simplest way I know of to test memory andd cpu, is to run a lot of large
kernel compiles.  Often a memory problem will cause that to segfault.
Anything htat uses lots of cpu and lots of memory is usually a good
test, at least if it fails spectacularly on an error, like gcc tends to
do.

To test the memory, remove half of it, and try the test.  If it fails,
replace one stick of memory with one of the other ones, until you can
run the test without a problem.  You could probably even run the test
with 1 or 2 sticks of memory.  A number of people have managed to find
faulty memory on an opteron this way.  Some people have come back going
"I found a faulty stick of memory" after swearing that memtest86 had
said all their ram was fine and they were sure their name brand ram
wasn't faulty. :)  memtest86 does't catch all errors.  Of course with
ECC memory I would have expected to see a machine check exception (MCE)
if there was any single bit errors in the memory.  I am still most
inclined to blame reiserfs or perhaps the cpu.  Of course since it was
multiple errors all coming from reiserfs, with apparently nothing else
seeing a problem, I really think it may simply be a reiserfs bug.  I was
using XFS before on early 2.6 kernels on i386, and even tually had to
give up and move to ext3 since it just wasn't reliably on top of LVM on
top of MD raid.  The filesystem had some bad interaction with the LVM
and MD raid that made it not work.  It probably got fixed since, but I
needed something that worked then, and ext3 worked.

> What about checking the cpu? I can simply tell that I monitored the 
> temperature during the long calculation, with the machine in a strongly 
> ventilated area. Starting from 36C, the temp raised to 44C at maximum. I 
> don't know the correspondence with real temp ($sensors) but the difference 
> should tell. AMD for my 265 dual opterons indicates case temperature 49-67C 
> (is what I measured just case temp?). AMD also indicate as temp limits 10-35, 
> but I gues this should be the ambient temperatures.

That temperature is fine as far as I can tell.

> Also, how to check thouroghly the disks?

Well there is badblocks which allows disk testing.  In my experience
though, modern disks tend to either work or fail.  They very rarely have
small problems.

--
Len Sorensen



Reply to: