Re: [Nbd] [PATCH/RFC] Add support for NBD_CMD_WRITE_ZEROES
- To: Wouter Verhelst <w@...112...>
- Cc: "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>
- Subject: Re: [Nbd] [PATCH/RFC] Add support for NBD_CMD_WRITE_ZEROES
- From: Alex Bligh <alex@...872...>
- Date: Sat, 16 Apr 2016 18:39:16 +0100
- Message-id: <736ADB91-9F8F-4EF1-B49F-AF2FE688DEAD@...872...>
- In-reply-to: <20160416173035.GC3033@...3...>
- References: <1460721838-44985-1-git-send-email-alex@...872...> <5710F65A.1080801@...696...> <AD69F7C6-1B65-48DA-942E-E858D3E4348D@...872...> <20160416173035.GC3033@...3...>
Wouter,
On 16 Apr 2016, at 18:30, Wouter Verhelst <w@...112...> wrote:
> On Fri, Apr 15, 2016 at 04:50:45PM +0100, Alex Bligh wrote:
>>
>> On 15 Apr 2016, at 15:10, Eric Blake <eblake@...696...> wrote:
>>
>>> You completely ignore NBD_CMD_FLAG_NO_HOLE; you could obey that flag
>>> (when set, do the calloc() here; when cleared, call into the trim code
>>> if you can guarantee that you'll read zero after trims).
>>
>> Sadly you cannot guarantee that you will read zeroes after
>> exptrim is called. Exceptions include:
>>
>> 1. F_COPYONWRITE being set (trim is ignored)
>
> I've been wanting to modify copyonwrite so that when you TRIM, it clears
> out the diff file, but not the original file.
>
> That would still mean it doesn't zero out stuff, obviously.
Yeah I looked at that. I'm not that familiar with the CoW (or
treefiles) stuff so didn't fancy coding it.
>> 2. The initial and final blocks if F_TREEFILES is set if either of
>> the start and end of the read is not aligned to TREEPAGESIZE
>
> We could detect if that's the case, and split the request up into three:
> head and tail with writes, middle part with treefile handling.
>
> Due to the nature of how the treefiles stuff works, it's 100% guaranteed
> that that would work.
Indeed. And I suspect CoW requires similar. And it would also be better
to pass through the fact it is a 'zero' write to the underlying
functions (I was going to use buf=null to mean 'write zeroes' which
would quickly flag up any error).
>> 3. Any underlying filesystems which don't support fallocate
>> with PUNCH_HOLE (we don't check the error return)
>
> Would be easy to add that.
Yep.
>> 4. Probably something on Windows (!)
>
> Gotta admit, I don't know what the semantics of the Windows punch_hole
> equivalent are.
>From a very quick look online, some filing systems work
and some don't.
>>> But if nothing
>>> else, you have a bug for not permitting the NBD_CMD_FLAG_NO_HOLE (even
>>> if you intend to ignore it by ALWAYS operating in no-hole mode, as was
>>> done here).
>>
>> Or I could do the physical write of zeroes and then call trim afterwards
>> if NBD_CMD_FLAG_NO_HOLE is *not* set and trim is supported.
>>
>> As fallocate() and friends are cheap, I think I will do that.
>
> That gets awkward if you remember all the possible backing store modes.
Yeah, it will be slow on some backing store modes. Fortunately
(or otherwise) on some of them trim doesn't actually do anything
so it won't cause an issue!
> I really should get started on that abstraction layer I was talking
> about earlier.
>
> I've been thinking we should also have the TRIM and WRITE_ZEROES
> handling be combined into one set of functions. You'd have a "expclear"
> function or some such which has two booleans:
>
> - must_zero: when WRITE_ZEROES was called, set it to true. Operation
> must then set all affected bytes to zero.
> - may_hole: when either TRIM was called or WRITE_ZEROES doesn't have the
> NO_HOLE flag set. Operation may deallocate the range if possible.
That's interesting. I was thinking about combining it with 'write'
with buf=0, and possibly keeping a page (mmap with MAP_ANON if
mmap is available) of zeroes around in virtual memory, marked as
read only, they could use.
>> Someone should really do this properly (which is non-trivial); I was
>> really getting a baseline code example for the extension branch
>> demonstration.
>
> We should probably keep it at that for now -- possibly add a "todo" item
> there.
Indeed. It was mainly meant to serve as an example of code in an
extension branch, and to give Eric a chance to test interoperability
with his qemu work.
--
Alex Bligh
Reply to: