Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting

To: Wouter Verhelst <wouter@...825...>
Cc: "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>, Paul Clements <paul.clements@...856...>, Jack Kara <jack@...1290...>, Wouter Verhelst <w@...112...>
Subject: Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
From: Alex Bligh <alex@...872...>
Date: Sun, 17 Nov 2013 17:19:17 +0000
Message-id: <C2E9E8AD-752C-4190-BD4F-45A9482FF400@...872...>
In-reply-to: <52889084.2080700@...825...>
References: <8bf7c5db475eefcf17976a36f892200d@...1427...> <20131112214632.GB31763@...1426...> <7c1b2ca40c3abfe805e9e944f21c7016@...1427...> <20131114075827.GA13554@...1426...> <5285D258.9040808@...112...> <CAECXXi6Vt5gAjv=qkrGzLG3iRjNmjYiYZd7+gCXK860a2tonKg@...18...> <52889084.2080700@...825...>

On 17 Nov 2013, at 09:46, Wouter Verhelst wrote:

>> 
>> In order for nbd to seamlessly handle this situation, we'd have to do a
>> reconnect in-kernel
> 
> This would be fairly complicated, since all the connection and
> negotiation currently happens in userspace. I'm not sure I want to go
> down that route.
> 
>> (or have a callout to userland to reconnect)
> 
> That sounds interesting, too. How would you do that?
> 
>> and
>> then we'd have to retry any I/Os that may have failed in the meantime
>> (or just let them fail, but that probably is not as useful).
> 

Would another option be as follows:

1. When persistency is required, a new persist flag is specified to
   the kernel by the client.

2. On a connection failure, if the persist flag is set, don't
   clear up and return with a specific error number. The fd is
   still open (as still owned by the process), but (by assumption)
   unusable.

3. In persist mode, The block device only gets torn down when
   the fd closes / userland process terminates (whichever is
   easier, detection method TBD). Until then all writes block.

4. A newer nbd client detects the errno in persist mode, opens another
   fd, and calls the NBD_DOIT ioctl passing the old fd as an
   additional parameter (or does a new ioctl first to associate
   the new fd with the old fd). A new kernel then detects this,
   closes the old fd, and 'takes over' the existing block device
   with the new fd.

On an old client, the kernel behaviour is thus unchanged. Similarly
if persist is not required. If a new client in persist mode crashes
after step (2), then the block device will still be torn down when
the process exits.

This avoids moving connection negotiation into the kernel (yuck,
and inflexible), avoids a call out to user land, and allows
(in theory) a reconnect whilst I/O is still active.

-- 
Alex Bligh

Reply to:

Follow-Ups:
- Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
  - From: Jan Kara <jack@...1290...>
- Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
  - From: Paul Clements <paul.clements@...856...>

References:
- Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
  - From: Jan Kara <jack@...1290...>
- Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
  - From: Wouter Verhelst <w@...112...>
- Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
  - From: Paul Clements <paul.clements@...856...>
- Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
  - From: Wouter Verhelst <wouter@...825...>

Prev by Date: Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
Next by Date: Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
Previous by thread: Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
Next by thread: Re: [Nbd] 3.12 BUG() on ext4, kernel crash on nbd-client when nbd server rebooting
Index(es):
- Date
- Thread