On 3/22/19 2:42 PM, Nir Soffer wrote: >> Add a protocol flag and corresponding transmission advertisement flag >> to make it easier for clients to inform the server of their intent. If >> the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two >> things: to perform a fallback to write when the client does not >> request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the >> lower network overhead); and to fail quickly with ENOTSUP if the >> client requested the flag but the server cannot write zeroes more >> efficiently than a normal write (so that the client is not penalized >> with the time of writing data areas of the disk twice). >> > > I think the issue is not that zero is slow as normal write, but that it is > not fast > enough so it worth the zero entire disk before writing data. In an image copy, where you don't know if the destination already started life with all zero, then you HAVE to copy zeros into the image for the holes; the only question is whether also pre-filling the entire image (with fewer calls) and then overwriting the prefill is faster than just writing the data areas once. So there is a tradeoff to see how much time do you add with the overhead of lots of small-length WRITE_ZEROES for the holes, vs. the overhead of one large-length WRITE_ZEROES for the entire image. There's ALSO a factor of how much of the image is holes vs. data - a pre-fill of only 10% of the image (which is mostly sparse) is less wasteful than a pre-fill of 90% of the image (which is mostly dense) - but that waste doesn't cost anything if prefill is O(1) regardless of size; vs. being painful if it is O(n) based on size. There are definitely heuristics at play, and I don't know that the NBD spec can go into any strong advice on what type of speedups are in play, only whether the write zero is on par with normal writes. And, given the uncertainties on what speedups (or slowdowns) a pre-fill might cause, it DOES show that knowing if an image started life all zero is an even better optimization, because then you don't have to waste any time on overwriting holes. But having another way to speed things up does not necessarily render this proposal as useless. >> Note that the Linux fallocate(2) interface may or may not be powerful >> enough to easily determine if zeroing will be efficient - in >> particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that >> insight; for block devices, it is known that ioctl(BLKZEROOUT) does >> NOT have a way for userspace to probe if it is efficient or slow. But >> with enough demand, the kernel may add another FALLOC_FL_ flag to use >> with FALLOC_FL_ZERO_RANGE, and/or appropriate ioctls with guaranteed >> ENOTSUP failures if a fast path cannot be taken. If a server cannot >> easily determine if write zeroes will be efficient, it is better off >> not advertising NBD_FLAG_SEND_FAST_ZERO. >> > > I think this can work for file based images. If fallocate() fails, the > client > will get ENOTSUP after the first call quickly. The negative case is fast, but that doesn't say anything about the positive case. Unless Linux adds a new FALLOC_FL_ bit, you have no guarantee whether fallocate() reporting success may still have happened because the kernel did a fallback to a slow write. If fallocate() comes back quickly, you got lucky; but if it takes the full time of a write(), you lost your window of opportunity to report ENOTSUP quickly. Hence, my hope that the kernel folks add a new FALLOC_FL_ flag to give us the semantics we want (of a guaranteed way to avoid slow fallbacks). > > For block device we don't have any way to know if a fallocate() or > BLKZEROOUT > will be fast, so I guess servers will never advertise FAST_ZERO. > As I said, you don't know that with BLKZEROOUT, but the kernel might give us another ioctl that DOES know. > Generally this new flag usefulness is limited. It will only help qemu-img > to convert > faster to file based images. Limited use case is still a use case. If there are cases where you can optimize by a simple extension to the protocol, and where either side lacking the extension is not fatal to the protocol, then it is worth doing. And so far, that is what this feels like to me. > > Do we have performance measurements showing significant speed up when > zeroing the entire image before coping data, compared with zeroing only the > unallocated ranges? Kevin may have more of an idea based on the patches he wrote for qemu-img, and which spurred me into proposing this email; maybe he can share numbers for his testing on regular files and/or block devices to at least get a feel for whether a speedup is likely with a sufficient NBD server. > > For example if the best speedup we can get in real world scenario is 2%, is > ti > worth complicating the protocol and using another bit? Gaining 2% of an hour may still be worth it. >> + set. Servers SHOULD NOT set this transmission flag if there is no >> + quick way to determine whether a particular write zeroes request >> + will be efficient, but the lack of an efficient write zero >> > > I think we should use "fast" instead of "efficient". For example when the > kernel > fallback to manual zeroing it is probably the most efficient way it can be > done, > but it is not fast. Seems like a simple enough wording change. >> @@ -2114,6 +2151,7 @@ The following error values are defined: >> * `EINVAL` (22), Invalid argument. >> * `ENOSPC` (28), No space left on device. >> * `EOVERFLOW` (75), Value too large. >> +* `ENOTSUP` (95), Operation not supported. >> * `ESHUTDOWN` (108), Server is in the process of being shut down. >> >> The server SHOULD return `ENOSPC` if it receives a write request >> @@ -2125,6 +2163,10 @@ request is not aligned to advertised minimum block >> sizes. Finally, it >> SHOULD return `EPERM` if it receives a write or trim request on a >> read-only export. >> >> +The server SHOULD NOT return `ENOTSUP` except as documented in >> +response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is >> +supported. >> > > This makes ENOTSUP less useful. I think it should be allowed to return > ENOTSUP > as response for other commands if needed. Sorry, but we have the problem of back-compat to worry about. Remember, the error values permitted in the NBD protocol are system-agnostic (they _happen_ to match Linux errno values, but not all the world uses the same values for those errors in their libc, so portable implementations HAVE to map between NBD_EINVAL sent over the wire and libc EINVAL used internally, even if the mapping is 1:1 on Linux). Since the NBD protocol has documented only a finite subset of valid errors, and portable clients have to implement a mapping, it's very probably that there exist clients written against the current NBD spec that will choke hard (and probably hang up the connection) on receiving an unexpected error number from the server which was not pre-compiled into their mapping. ANY server that replies with ENOTSUP at the moment is in violation of the existing server requirements, whether or not clients have a high quality of implementation and manage to tolerate the server's noncompliance. Thus, when we add new errno values as being valid returns, we have to take care that servers SHOULD NOT send the new errno except to clients that are prepared for the error - a server merely advertising NBD_FLAG_SEND_FAST_ZERO is _still_ insufficient to give the server rights to send ENOTSUP (since the server can't know if the client recognized the advertisement, at least until the client finally sends a NBD_CMD_FLAG_FAST_ZERO flag). (Note, I said SHOULD NOT, not MUST NOT - if your server goofs and leaks ENOTSUP to a client on any other command, most clients will still be okay, and so you probably won't have people complaining that your server is broken. The only MUST NOT send ENOTSUP is for the case where the server advertised FAST_ZERO probing and the client did not request FAST_ZERO, because then server has to assume the client is relying on the server to do fallback handling for reduced network traffic.) > > I think this makes sense, and should work, but we need more data supporting > that this is > useful in practice. Fair enough - since Kevin has already got patches proposed against qemu to wire up a qemu flag BDRV_REQ_NO_FALLBACK, which should map in a rather straightforward manner to my NBD proposal (any qemu request sent with the BDRV_REQ_NO_FALLBACK bit set turns into an NBD_CMD_WRITE_ZEROES with the NBD_CMD_FLAG_FAST_ZERO set), it should be pretty easy for me to demonstrate a timing analysis of the proposed reference implementation, to prove that it either makes a noticeable difference or was in the noise. But it may be a couple of weeks before I work on a reference implementation - even if Kevin's patches are qemu 4.0 material to fix a speed regression, getting a new NBD protocol extension included during feature freeze is too much of a stretch. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Attachment:
signature.asc
Description: OpenPGP digital signature