[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: continuous reboots in a two nodes cluster with heartbeat and pacemaker.



On 8/17/2012 1:52 AM, Mauro wrote:
> On 14 August 2012 08:24, Mauro <mrsanna1@gmail.com> wrote:
>> On 13 August 2012 22:58, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>
>>> That being the case I'd suspect something other than server hardware.
>>> To be sure, manually remove one node from the cluster and see how long
>>> the remaining node runs without rebooting.  If it doesn't reboot at all,
>>> that eliminates hardware as the fault point.
>>
>> good idea, I do it now.
> 
> I've done what you have suggested.
> It seems that the node reboots without reason.
> It is like it is powered off, in fact in the boolog I see that the
> journal filesystem is recovered.
> It seems very strange to me, perhaps ram bugged?

I'd be thoroughly inspecting the power circuits feeding those servers at
this point.  Do you have the machines set to automatically power back on
after power loss?  If you do, switch that mode so they stay off after AC
power loss.  That should confirm whether the problem is total loss of AC
voltage or a severely deep sag.

If the problem is a less severe sag, however, this test won't isolate
the problem.  For that you must dig into the UPS monitoring interface.
If you don't have a UPS, you'll have to put a tap on the AC circuit and
monitor the voltage.  This will require specialized equipment, as it
must be able to log the sag.  Some of the nicer Fluke meters can log the
lowest voltage, but probably can't tell you the time of day when the sag
occurs.  Thus, you'll need to highly trained electrician with the proper
equipment.

This could also be a thermal issue.  Do you have hardware monitoring
installed and properly configured?  The 'sensors' package?  Over temp
conditions will often cause random reboots.  Do the boxes have plenty of
zero restriction cool airflow?  Less than 25 Celsius intake air temperature?

The odds of having defective hardware in two HP servers causing random
reboots in both machines is extremely low, though possible.  If this is
the case it's a design flaw, not simply two defective parts.

It's also possible you have the wrong memory installed.  Can you provide
the specs on all DIMMs installed in both machines?  Did all of the
memory come preinstalled from HP?  Is it HP memory or aftermarket memory
from Kingston, Crucial, etc?

-- 
Stan


Reply to: