Re: [Libguestfs] [libnbd PATCH v3 06/22] states: Break deadlock if server goofs on extended replies
On 7/21/23 14:50, Eric Blake wrote:
> On Thu, Jun 01, 2023 at 03:42:30PM +0200, Laszlo Ersek wrote:
>> On 5/25/23 15:00, Eric Blake wrote:
>>> One of the benefits of extended replies is that we can do a
>>> fixed-length read for the entire header of every server reply, which
>>> is fewer syscalls than the split-read approach required by structured
>>> replies.
>>
>> (Totally tangential comment: recvmsg() could scatter the incoming
>> traffic into non-contiguous fields of a non-packed structure. But I
>> don't know if TLS has anything similar. The current "linear receive"
>> approach is probably the least demanding of the underlying socket
>> abstractions. At the same time it requires us to use packed structs,
>> and/or multiple syscalls.)
>>
>>> But one of the drawbacks of doing a large read is that if
>>> the server is non-compliant (not a problem for normal servers, but
>>> something I hit rather more than I'd like to admit while developing
>>> extended header support in servers),
>>
>> Haha, so this is where it's coming from! :) I can totally relate.
>>
>>> nbd_pwrite() and friends will
>>> deadlock if the server replies with the wrong header. Add in some
>>> code to catch that failure mode and move the state machine to DEAD
>>> sooner, to make it easier to diagnose the fault in the server.
>>>
>>> Unlike in the case of an unexpected simply reply from a structured
>>
>> (1) s/simply/simple/
>
> Yep.
>
>>
>>> server (where we never over-read, and can therefore
>>
>> (2) At this point my English parser gets thrown off:
>>
>>> commit b31e7bac
>>> can merely fail the command with EPROTO and successfully move on to
>>> the next server reply),
>
> Fixing the parenthetical as follows:
>
> (where, since commit b31e7bac, we can merely fail the command with
> EPROTO and successfully move on to the next server reply, because we
> didn't read too many bytes yet)
>
>>
>> resync here:
>>
>>> in this case we really do have to move to
>>> DEAD: in addition to having already read the 16 or 20 bytes that the
>>> server sent in its (short) reply for this command, we may have already
>>> read the initial bytes of the server's next reply, but we have no way
>>> to push those extra bytes back onto our read stream for parsing on our
>>> next pass through the state machine.
>>>
>>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>> ---
>>> generator/states-reply.c | 23 ++++++++++++++++++++++-
>>> 1 file changed, 22 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/generator/states-reply.c b/generator/states-reply.c
>>> index 4e9f2dde..d4710d91 100644
>>> --- a/generator/states-reply.c
>>> +++ b/generator/states-reply.c
>>> @@ -109,7 +109,28 @@ REPLY.START:
>>> REPLY.RECV_REPLY:
>>> switch (recv_into_rbuf (h)) {
>>> case -1: SET_NEXT_STATE (%.DEAD); return 0;
>>> - case 1: SET_NEXT_STATE (%.READY); return 0;
>>> + case 1: SET_NEXT_STATE (%.READY);
>>> + /* Special case: if we have a short read, but got at least far
>>> + * enough to decode the magic number, we can check if the server
>>> + * is matching our expectations. This lets us avoid deadlocking if
>>> + * a buggy server sends only 16 bytes of a simple reply, and is
>>> + * waiting for our next command, while we are blocked waiting for
>>> + * the server to send 32 bytes of an extended reply.
>>> + */
>>
>> (4) Slight inconsistency between commit message and code comment: the
>> former mentions "20 bytes", but the latter doesn't mention "structured
>> reply".
>
> Did I miss (3)? But yes, I can improve this comment.
My mistake, probably due to over-editing! :)
Thanks
Laszlo
>
>>
>>> + if (h->extended_headers &&
>>> + (char *)h->rbuf >= (char *)&h->sbuf.reply.hdr.extended.flags) {
>>
>> (.) I wonder if (address-of-magic + size-of magic) might be more
>> readable / greppable. Just in case.
>>
>> Feel free to ignore.
>
> Actually, I agree that it is nicer (although a bit longer).
>
>>
>>> + uint32_t magic = be32toh (h->sbuf.reply.hdr.extended.magic);
>>> + if (magic != NBD_EXTENDED_REPLY_MAGIC) {
>>> + SET_NEXT_STATE (%.DEAD); /* We've probably lost synchronization. */
>>> + set_error (0, "invalid or unexpected reply magic 0x%" PRIx32, magic);
>>> +#if 0 /* uncomment to see desynchronized data */
>>> + nbd_internal_hexdump (&h->sbuf.reply.hdr.extended.flags,
>>> + sizeof (h->sbuf.reply.hdr.extended.flags),
>>> + stderr);
>>> +#endif
>>> + }
>>> + }
>>> + return 0;
>>
>> (5) This could be factored out to a helper function, to share the
>> "invalid:" label logic with the previous patch.
>>
>> (6) Commencing a dump from "flags" onward looks questionable. At this
>> point, the "flags" field need not to be complete, or even started -- all
>> we know is that the "magic" field *before" "flags" is complete.
>>
>> I think we should dump "h->sbuf.reply.hdr.simple", like in patch#5.
>
> Yep, I noticed that while addressing your point (4), just before
> reading your point (6). Bug dumping the full hdr.simple is also too
> much; really, given the preconditions above, all we can dump is
> hdr.magic. At which point, the hexdump is not providing us any
> further information than what the set_error() call already output.
> I'm just deleting the #if 0 segment, which in turn eliminates the need
> to address point (5), as the two invalid: labels are no longer
> identical.
>
>>
>> (.) Side comment (so no bullet number assigned): because we can take
>> multiple iterations on RECV_REPLY, we may end up checking the "magic"
>> field multiple times. Not very costly, just something to be aware of.
>
> Indeed. In the common case, short reads are rare, and even when they
> do happen, it is even more rare to hit it more than once per packet.
> I wonder if nc or similar has a way to force a server's response to be
> flushed after every byte, to see the performance impact of maximum
> network overhead.
>
> But your observation means you are correctly aware of a larger design
> aspect of the state machine: all code handling short reads (case 1 of
> recv_into_rbuf) must be re-startable (no messing with h->rbuf, for
> example). Stateful changes can only occur when the read is complete
> (case 0) or irretrievably failed (case -1).
>
>>
>> (7) Assume that we have a short read first, but then complete
>> REPLY.RECV_REPLY successfully, and move to
>> REPLY.CHECK_SIMPLE_OR_STRUCTURED_REPLY.
> ...
>>
>> *HOWEVER*! :) If we do *not* see some short-but-together-long-enough
>> reads first, but see a full read at once, then our "pre-check" in
>> RECV_REPLY does not get activated at all! And then all the conditions in
>> CHECK_SIMPLE_OR_STRUCTURED_REPLY remain necessary.
>>
>> So this is my long-winded way to ask that:
>>
>>> case 0: SET_NEXT_STATE (%CHECK_SIMPLE_OR_STRUCTURED_REPLY);
>>> }
>>> return 0;
>>
>> you please consider adding a comment here, saying that, while the short
>> read special case *does* short-circuit a number of checks under
>> CHECK_SIMPLE_OR_STRUCTURED_REPLY, all the checks in
>> CHECK_SIMPLE_OR_STRUCTURED_REPLY remain required if under RECV_REPLY we
>> get a full read immediately.
>
> Good idea. Added for v4.
>
Reply to: