[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: continuous reboots in a two nodes cluster with heartbeat and pacemaker.



On 8/12/2012 4:44 AM, Mauro wrote:
> On 11 August 2012 19:23, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 8/11/2012 8:59 AM, Mauro wrote:
>>> Hello, I'm experiencing continuous reboots of my two nodes in a
>>> heartbeat+pacemaker cluster.
>>> Reboots are random, one day they happen one other day not, sometime
>>> for 7 days they don't happen, sometimes they happen at night.
>>> They happen at random days and random time.
>>> Nodes are connected to a Cisco 3570 switch and a SAN storage system.
>>> Perhaps there is a misconfiguration in the interfaces?
>>> Here is my interfaces file:
>> ....
>>
>>
>>> Do you think there are some errors?
>>
>> To determine that you need to look at your logs files, not your config
>> files.  If the nodes are rebooting due to fencing it will be logged
>> somewhere, as should the underlying network errors that cause the fence
>> to close.
> 
> Yes, I look at my logs but the only thing I see is that node 1 fence
> node 2 or node 2 fence node 1 because one node doesn't see other node,
> but I don't understard what is the problem, if it is a problem of my
> NIC or other.

Is there more than one set of these in any dmes files on either host:

Jul 26 00:38:26 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Down
Jul 26 00:38:28 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Up
100 Mbps Full Duplex

If so it may indicate a flaky NIC or switch port, possibly a bad patch
cable.  Is there a switch between the hosts or a cross over cable?

But, look at the time interval between the down/up states.  If it's
always less than the cluster action threshold then this shouldn't be an
issue.  If it's greater than the threshold it is likely the cause of the
software fence activating.

There are other possible causes.  This is simply the first that comes to
mind.

-- 
Stan


Reply to: