Re: continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 8/12/2012 4:27 PM, Mauro wrote:
> On 12 August 2012 20:39, Stan Hoeppner <email@example.com> wrote:
>> On 8/12/2012 4:44 AM, Mauro wrote:
>>> On 11 August 2012 19:23, Stan Hoeppner <firstname.lastname@example.org> wrote:
>>>> On 8/11/2012 8:59 AM, Mauro wrote:
>>>>> Hello, I'm experiencing continuous reboots of my two nodes in a
>>>>> heartbeat+pacemaker cluster.
>>>>> Reboots are random, one day they happen one other day not, sometime
>>>>> for 7 days they don't happen, sometimes they happen at night.
>>>>> They happen at random days and random time.
>>>>> Nodes are connected to a Cisco 3570 switch and a SAN storage system.
>>>>> Perhaps there is a misconfiguration in the interfaces?
>>>>> Here is my interfaces file:
>>>>> Do you think there are some errors?
>>>> To determine that you need to look at your logs files, not your config
>>>> files. If the nodes are rebooting due to fencing it will be logged
>>>> somewhere, as should the underlying network errors that cause the fence
>>>> to close.
>>> Yes, I look at my logs but the only thing I see is that node 1 fence
>>> node 2 or node 2 fence node 1 because one node doesn't see other node,
>>> but I don't understard what is the problem, if it is a problem of my
>>> NIC or other.
>> Is there more than one set of these in any dmes files on either host:
>> Jul 26 00:38:26 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Down
>> Jul 26 00:38:28 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Up
>> 100 Mbps Full Duplex
> No, any link down in any log file :-(
> I really don't understand why the reboots :-(
>> If so it may indicate a flaky NIC or switch port, possibly a bad patch
>> cable. Is there a switch between the hosts or a cross over cable?
> There is a cisco 3570 switch.
Are these controlled shutdowns? Or are these hardware crash/reboots
that are occurring?
If the former you should see syslog entries for the shutdown sequence.
If the latter, you won't see anything in the logs. This would suggest
you've got a hardware problem, and not related to faulty NICs or switches.
What kind of UPS are these machines powered from? Have you checked the
UPS and verified they are functioning properly? If you have a power
even and the UPS drop the load, the machines will reboot without a hint
in the logs as to what caused the reboot.
Finally, what servers are theses? Dell/HP/IBM or whitebox? Memory
mismatch or simply bad memory can cause inexplicable reboots. If the
machines are decent quality, they BIOS should log such events.