First off, thanks for the replies so far. Let me address some points in turn: also sprach Hans du Plooy <firstname.lastname@example.org> [2004.04.07.1436 +0200]: > This has nothing to do with CPU. Things to check: 1. Hard > drive/CD-ROM data cable. Especially if it is the thin (ATA-66 and > up) - they damage very easily around the connectors. I've also > seen I/O errors and data corruption a lot with those cables if > there is the slightest bit of isolation damaged, life if the cable > scratched against a sharp edge in the box. Sure, this is something I should try. Especially since one of the boxes runs a software RAID 5 spanning 4 IDE disks. I know, one should not put IDE disks in a production server, but as it stands, I don't have another way to provide 200+ Gb one a low budget. By now, this thing is probably costing me more time than a SAN. ;^> Anyway, what I am meaning to say: there are four cables. That's 0.25 * Estimated Time to Failure. > 2. Hard drive - although you can verify this easily by putting it > in another machine. The last time I checked, they were all without bad blocks. They are also each cooled individually. Since the behaviour has not changed since I last checked them, it seems that the problem must be elsewhere. > 3. Controller. What controller is it? Onboard or PCI? What > chipset? Older VIA chipsets can be a bit buggy if the > manufacturer of the motherboard didn't take care when implementing > it. There are two, on-board in addition to a Promise 20269 PCI. I have tried a Hipoint controller in place of either, but no change. About system temperature. lmsensors reports it to be constant bettween 50 and 60 degrees for both, CPU and ambience. > So, my guess is your problems isn't CPU related. That's what I thought. also sprach Christian Schnobrich <email@example.com> [2004.04.07.1437 +0200]: > In your case, I'd start by closely inspecting the system for > electrical issues like aged capacitors, blank or broken wires, > half-lose connectors et al. Being not an engineer not particularly good at electronics, I am not sure if this will actually provide a benefit. I have the budget to simply buy a new one in the interest to get the system stable ASAP. Does anyone have recommendations for motherboards supporting Athlon/Duron chips. I think the one I have is a 1800+ or so, but I can't verify right now since just as I left town, the system decided to hang itself up. The hanging is actually interesting and summarised here: http://marc.theaimsgroup.com/?l=linux-kernel&m=108110943225559&w=2 Note that even though I wrote "surprised", this is actually an old issue. > Memtest always is a good idea if you can afford the downtime. I can't, but I did. No errors. Who would have thought? also sprach Hans du Plooy <firstname.lastname@example.org> [2004.04.07.1518 +0200]: > AMDs are cheap, not matter how you look at it. Therefore they are > a popular choice for companies selling "budget" machines. I am sure I don't have an ElCheapo, but it's probably noot a platinum one either. Mid-range VIA 82cxx. Details tomorrow when the system has rebooted. also sprach Henrique de Moraes Holschuh <email@example.com> [2004.04.07.1633 +0200]: > Let me guess: VIA chipset? I have a A7V motherboard that does the > same, unpredictably. The PCI bus just hangs the entire machine. > After that one, I tried to learn a thing or two about common > consumer computer stabilities. Does your hanging sometimes come with Ooooopses and BUGs? Does it have the same symptoms as described in the linux-kernel thread? For convenience: http://marc.theaimsgroup.com/?l=linux-kernel&m=108110943225559&w=2 > ECC memory is extremely more resilient to corruption. It WILL > experience bit flips as often as common memory, obviously... But > you need two bit flips *in a certain area* (that must happen > before the affected area is accessed again), to get memory > corruption. That is far more unlikely to happen. Can memtest86 detect such errors? > You need a top-notch power supply and good cooling too, of course. > Most power supplies aren't adequate for non-error operation. You > have to handpick them. And the good ones ain't cheap. Does it suffice to connect it to a stabilising UPS? also sprach Roberto Sanchez <firstname.lastname@example.org> [2004.04.07.1539 +0200]: > Are you running an nForce2 board? If so, what kernel? What is > your .config? No. 2.6.5 kernel. Config here: ftp://ftp.madduck.net/scratch/config-2.6.5-gaia.gz also sprach CW Harris <email@example.com> [2004.04.07.1940 +0200]: > Any common software activity (at the time of crashing) between the > two machines that might be the cause? (No one seems to be asking > about the software ;) Both run Debian unstable on a 2.6 kernel. One (an SMP machine) is actually very stable when I use Herbert's kernels (e.g. 2.6.3-1-k7 right now), while a custom kernel will crash the machine. I have been working with Herbert, but we can't find the problem. If you want to give it a shot, the two configs are here. Note that they are for different kernel versions, but I did try two 2.6.3s against each other. So the configuration for the 2.6.4 kernel is the same as 2.6.3 with some additional stuff that came in between. No changes to existing data though. ftp://ftp.madduck.net/scratch/config-2.6.3-1-k7-smp.gz ftp://ftp.madduck.net/scratch/config-2.6.4-diamond.gz My current thinking (actually, mostly Herbert's) is that it's an IRQ-related problem. I tried booting with all permutations of ACPI, APIC and LAPIC, but no dice with either. It wouldn't be the first time that IRQs get an x86 machine down. Both machines have periods of excessive disk I/O and deal with large files (ZopeDB of 4+ Gb). Moreover, I can get them to their knees by doing something like: cd /home while true; do rsync -Pva --delete ./zope/ ./dump rsync -Pva --delete ./staff/ ./dump done Thus, an IRQ problem is not unlikely, I think. Again, thanks for your time. Take care, -- Please do not CC me when replying to lists; I read them! .''`. martin f. krafft <firstname.lastname@example.org> : :' : proud Debian developer, admin, and user `. `'` `- Debian - when you have better things to do than fixing a system Invalid/expired PGP subkeys? Use subkeys.pgp.net as keyserver!
Description: Digital signature