[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: AMD vs. Intel



First off, thanks for the replies so far. Let me address some points
in turn:

also sprach Hans du Plooy <hansdp@newingtoncs.co.za> [2004.04.07.1436 +0200]:
> This has nothing to do with CPU.  Things to check: 1. Hard
> drive/CD-ROM data cable.  Especially if it is the thin (ATA-66 and
> up) - they damage very easily around the connectors.  I've also
> seen I/O errors and data corruption a lot with those cables if
> there is the slightest bit of isolation damaged, life if the cable
> scratched against a sharp edge in the box.

Sure, this is something I should try. Especially since one of the
boxes runs a software RAID 5 spanning 4 IDE disks. I know, one
should not put IDE disks in a production server, but as it stands,
I don't have another way to provide 200+ Gb one a low budget. By
now, this thing is probably costing me more time than a SAN. ;^>
Anyway, what I am meaning to say: there are four cables. That's 0.25
* Estimated Time to Failure.

> 2. Hard drive - although you can verify this easily by putting it
> in another machine.

The last time I checked, they were all without bad blocks. They are
also each cooled individually. Since the behaviour has not changed
since I last checked them, it seems that the problem must be
elsewhere.

> 3. Controller.  What controller is it?  Onboard or PCI?  What
> chipset?  Older VIA chipsets can be a bit buggy if the
> manufacturer of the motherboard didn't take care when implementing
> it.

There are two, on-board in addition to a Promise 20269 PCI. I have
tried a Hipoint controller in place of either, but no change.

About system temperature. lmsensors reports it to be constant
bettween 50 and 60 degrees for both, CPU and ambience.

> So, my guess is your problems isn't CPU related.

That's what I thought.



also sprach Christian Schnobrich <schnobs@babylon-kino.de> [2004.04.07.1437 +0200]:
> In your case, I'd start by closely inspecting the system for
> electrical issues like aged capacitors, blank or broken wires,
> half-lose connectors et al.

Being not an engineer not particularly good at electronics, I am not
sure if this will actually provide a benefit. I have the budget to
simply buy a new one in the interest to get the system stable ASAP.

Does anyone have recommendations for motherboards supporting
Athlon/Duron chips. I think the one I have is a 1800+ or so, but
I can't verify right now since just as I left town, the system
decided to hang itself up.

The hanging is actually interesting and summarised here:

  http://marc.theaimsgroup.com/?l=linux-kernel&m=108110943225559&w=2

Note that even though I wrote "surprised", this is actually an old
issue.

> Memtest always is a good idea if you can afford the downtime.

I can't, but I did. No errors. Who would have thought?



also sprach Hans du Plooy <hansdp@newingtoncs.co.za> [2004.04.07.1518 +0200]:
> AMDs are cheap, not matter how you look at it. Therefore they are
> a popular choice for companies selling "budget" machines.

I am sure I don't have an ElCheapo, but it's probably noot
a platinum one either. Mid-range VIA 82cxx. Details tomorrow when
the system has rebooted.

also sprach Henrique de Moraes Holschuh <hmh@debian.org> [2004.04.07.1633 +0200]:
> Let me guess: VIA chipset?  I have a A7V motherboard that does the
> same, unpredictably.  The PCI bus just hangs the entire machine.
> After that one,  I tried to learn a thing or two about common
> consumer computer stabilities.

Does your hanging sometimes come with Ooooopses and BUGs? Does it
have the same symptoms as described in the linux-kernel thread?

For convenience:

  http://marc.theaimsgroup.com/?l=linux-kernel&m=108110943225559&w=2

> ECC memory is extremely more resilient to corruption. It WILL
> experience bit flips as often as common memory, obviously... But
> you need two bit flips *in a certain area* (that must happen
> before the affected area is accessed again), to get memory
> corruption.  That is far more unlikely to happen.

Can memtest86 detect such errors?

> You need a top-notch power supply and good cooling too, of course.
> Most power supplies aren't adequate for non-error operation.  You
> have to handpick them.  And the good ones ain't cheap.

Does it suffice to connect it to a stabilising UPS?

also sprach Roberto Sanchez <rcsanchez97@yahoo.es> [2004.04.07.1539 +0200]:
> Are you running an nForce2 board?  If so, what kernel?  What is
> your .config?

No. 2.6.5 kernel. Config here:

  ftp://ftp.madduck.net/scratch/config-2.6.5-gaia.gz



also sprach CW Harris <charris@rtcmarketing.com> [2004.04.07.1940 +0200]:
> Any common software activity (at the time of crashing) between the
> two machines that might be the cause?  (No one seems to be asking
> about the software ;)

Both run Debian unstable on a 2.6 kernel. One (an SMP machine) is
actually very stable when I use Herbert's kernels (e.g. 2.6.3-1-k7
right now), while a custom kernel will crash the machine. I have
been working with Herbert, but we can't find the problem. If you
want to give it a shot, the two configs are here. Note that they are
for different kernel versions, but I did try two 2.6.3s against each
other. So the configuration for the 2.6.4 kernel is the same as
2.6.3 with some additional stuff that came in between. No changes to
existing data though.

  ftp://ftp.madduck.net/scratch/config-2.6.3-1-k7-smp.gz
  ftp://ftp.madduck.net/scratch/config-2.6.4-diamond.gz

My current thinking (actually, mostly Herbert's) is that it's an
IRQ-related problem. I tried booting with all permutations of ACPI,
APIC and LAPIC, but no dice with either. It wouldn't be the first
time that IRQs get an x86 machine down.

Both machines have periods of excessive disk I/O and deal with large
files (ZopeDB of 4+ Gb). Moreover, I can get them to their knees by
doing something like:

  cd /home
  while true; do
    rsync -Pva --delete ./zope/ ./dump
    rsync -Pva --delete ./staff/ ./dump
  done

Thus, an IRQ problem is not unlikely, I think.

Again, thanks for your time. Take care,

-- 
Please do not CC me when replying to lists; I read them!
 
 .''`.     martin f. krafft <madduck@debian.org>
: :'  :    proud Debian developer, admin, and user
`. `'`
  `-  Debian - when you have better things to do than fixing a system
 
Invalid/expired PGP subkeys? Use subkeys.pgp.net as keyserver!

Attachment: signature.asc
Description: Digital signature


Reply to: