[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#698225: linux-image-2.6.32-5-686-bigmem: split-brain when running "drbdadm primary $DEV" with dual primary setup in Sec/Sec state



Control: tags -1 moreinfo

On Tue, Jan 15, 2013 at 03:18:11PM +0100, bd@bc-bd.org wrote:
> Issue
> 	drbdadm primary $DEV
> On both nodes at the same time (either via cluster resource manager, or mssh) will lead to a split brain:

This does not match the kernel log, as far as I understand it.

> [ 5067.503912] block drbd0: Split-Brain detected, dropping connection!

This is correct according to the log.

> [ 5034.677693] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
> [ 5034.690089] block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
> [ 5034.691786] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
> [ 5034.692927] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
> [ 5034.692931] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) 
> [ 5034.692934] block drbd0: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
> [ 5035.242347] block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
> [ 5035.242355] block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) 
> [ 5035.242362] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
> [ 5035.243575] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)

A sync from the remote disk.

> [ 5046.639518] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) 
> [ 5046.639942] block drbd0: meta connection shut down by peer.

Remote was shut down.

> [ 5046.639993] block drbd0: asender terminated
> [ 5046.639994] block drbd0: Terminating drbd0_asender
> [ 5046.641094] block drbd0: conn( TearDown -> Disconnecting ) 
> [ 5046.659176] block drbd0: Connection closed
> [ 5046.659182] block drbd0: conn( Disconnecting -> StandAlone ) 
> [ 5046.659217] block drbd0: receiver terminated
> [ 5046.659218] block drbd0: Terminating drbd0_receiver
> [ 5046.659221] block drbd0: disk( UpToDate -> Diskless ) 
> [ 5046.659296] block drbd0: drbd_bm_resize called with capacity == 0
> [ 5046.659305] block drbd0: worker terminated
> [ 5046.659307] block drbd0: Terminating drbd0_worker

Device is gone.

> [ 5067.155466] block drbd0: Starting worker thread (from cqueue [2337])
> [ 5067.155541] block drbd0: disk( Diskless -> Attaching ) 
> [ 5067.207081] block drbd0: conn( Unconnected -> WFConnection ) 

Device enabled again and trying to connect.

> [ 5067.208501] block drbd0: role( Secondary -> Primary ) 
> [ 5067.212759] block drbd0: Creating new current UUID

Set to primary.

> [ 5067.503518] block drbd0: Handshake successful: Agreed network protocol version 91
> [ 5067.503525] block drbd0: conn( WFConnection -> WFReportParams ) 

Connection established _after_ it was promoted to primary.

> [ 5067.503888] block drbd0: drbd_sync_handshake:
> [ 5067.503894] block drbd0: self D88E7AD12FFEA493:49D971C9C18FC2FE:167E069D45704F1A:F1C0D4200B9792F4 bits:0 flags:0
> [ 5067.503899] block drbd0: peer DD932456670DF62F:49D971C9C18FC2FE:167E069D45704F1A:F1C0D4200B9792F4 bits:0 flags:0

The remote device was also promoted to primary before the connection was
established.

You have to wait until both machines are connected before promoting them
to primary. The init script does this.

Bastian

-- 
Behind every great man, there is a woman -- urging him on.
		-- Harry Mudd, "I, Mudd", stardate 4513.3


Reply to: