Re: [Nbd] [RFC] [PATCH] support FUA and FLUSH
- To: Alex Bligh <alex@...872...>
- Cc: nbd-general@lists.sourceforge.net
- Subject: Re: [Nbd] [RFC] [PATCH] support FUA and FLUSH
- From: Wouter Verhelst <w@...112...>
- Date: Mon, 16 May 2011 15:28:32 +0200
- Message-id: <20110516132832.GC2954@...510...>
- In-reply-to: <34D6AC5A48C6021D03F87234@...873...>
- References: <34D6AC5A48C6021D03F87234@...873...>
On Mon, May 16, 2011 at 01:46:09PM +0100, Alex Bligh wrote:
> Enclosed is a patch (COMPILE TESTED ONLY - DO NOT RUN ON LIVE SYSTEMS) which
> supports FUA (force unit access) and FLUSH.
>
> FUA means "force the current write to hit the media service" and FLUSH
> means "empty the current write queue to disk". Broadly they have the same
> semantics as the linux kernel REQ_FLUSH and REQ_FUA. FUA is implemented
> through sync_file_range() (or if that doesn't exist, fdatasync() on the
> file handle concerned), FLUSH through fdatasync() on all files. FUA and
> FLUSH are selected in the config file, and set new flags bits which will
> cause the client to sent FUA and FLUSH requests. The way these are
> implemented is further explained in doc/proto.txt.
>
> The purpose of this is reasonably obvious: without supporting either FUA
> or FLUSH (and it's relatively easy to support both), filesystems on the
> client have no way to ensure the relevant sectors have hit the disk.
> The patch is implement such that the default behaviour is unchanged.
>
> Additionally, it introduces an F_ROTATIONAL flag. This will turn off
> the use of QUEUE_FLAG_NONROT in the client. QUEUE_FLAG_NONROT effectively
> disables the elevator algorithm, making the algorithm merge only. That is
> unhelpful where the server does not have its own elevator algorithm
> or where the client elevator algorithm is neutered (e.g. writing to a
> raw partition with F_SYNC with nbd-server). It's not going to be used
> often where the backing store is a file.
>
> It also incidentally fixes a bug where F_SYNC is ignored if F_COPYONWRITE
> is set (not that this currently has much utility).
>
> I have not written the client side of this yet (see comment re "compile
> tested only"). What I'd like to know (before I spend too much time on it)
> is whether this is a useful course to pursue, and whether I'm doing this
> in a vaguely useful way. I repeat the point about this code not being
> tested at all beyond compilation.
>
> Clean version of patch at
> http://www.alex.org.uk/nbd-flush-fua-01.patch
> in case my mailer mangles newlines
>
> --
> Alex Bligh
>
> Signed-Off-By: Alex Bligh <alex@...872...>
>
> diff --git a/cliserv.h b/cliserv.h
> index 51c8bd1..63c2743 100644
> --- a/cliserv.h
> +++ b/cliserv.h
> @@ -40,7 +40,12 @@ typedef unsigned long long u64;
> #include "nbd.h"
>
> #if NBD_LFS==1
> +/* /usr/include/features.h (included from /usr/include/sys/types.h)
> + defines this when _GNU_SOURCE is defined
> + */
> +#ifndef _LARGEFILE_SOURCE
> #define _LARGEFILE_SOURCE
> +#endif
> #define _FILE_OFFSET_BITS 64
> #endif
>
> @@ -135,6 +140,9 @@ u64 ntohll(u64 a) {
> /* Flags used between the client and server */
> #define NBD_FLAG_HAS_FLAGS (1 << 0) /* Flags are there */
> #define NBD_FLAG_READ_ONLY (1 << 1) /* Device is read-only */
> +#define NBD_FLAG_SEND_FLUSH (1 << 2) /* Send FLUSH */
> +#define NBD_FLAG_SEND_FUA (1 << 3) /* Send FLUSH */
"Send FUA" rather than "Send FLUSH"? Also, please expand what 'FUA'
stands for in this comment.
> +#define NBD_FLAG_ROTATIONAL (1 << 3) /* Use elevator algorithm -
> rotational media */
That's the same value as the _SEND_FUA thing here; should that not be
different?
> #define NBD_DEFAULT_PORT "10809" /* Port on which named exports are
> * served */
> diff --git a/configure.ac b/configure.ac
> index f50193f..c78177b 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -94,6 +94,9 @@ AC_CHECK_SIZEOF(unsigned int)
> AC_CHECK_SIZEOF(unsigned long int)
> AC_CHECK_SIZEOF(unsigned long long int)
> AC_CHECK_FUNCS([llseek alarm gethostbyname inet_ntoa memset socket
> strerror strstr])
> +AC_CHECK_FUNC([sync_file_range],
> + [AC_DEFINE([HAVE_SYNC_FILE_RANGE], [sync_file_range(2) is not supported],
> [sync_file_range(2) is supported])],
> + [])
> AC_FUNC_FORK
> AC_FUNC_SETVBUF_REVERSED
> AC_MSG_CHECKING(whether client should be built)
> diff --git a/doc/proto.txt b/doc/proto.txt
> index fe5e819..d9a269b 100644
> --- a/doc/proto.txt
> +++ b/doc/proto.txt
> @@ -26,8 +26,9 @@ server during the handshake.
> There are two message types in the data pushing phase: the request, and
> the response.
>
> -There are three request types in the data pushing phase: NBD_CMD_READ,
> -NBD_CMD_WRITE, and NBD_CMD_DISC (disconnect).
> +There are four request types in the data pushing phase: NBD_CMD_READ,
> +NBD_CMD_WRITE, NBD_CMD_WRITE_FUA, NBD_CMD_DISC (disconnect), and
> +NBD_CMD_FLUSH.
>
> The request is sent by the client; the response by the server. A request
> header consists a 32 bit magic number (magic), a 32 bit field denoting
> @@ -50,6 +51,15 @@ we change that to asynchronous handling, handling the
> disconnect request
> will probably be postponed until there are no other outstanding
> requests.
>
> +NBD_CMD_WRITE_FUA is the equivalent of NBD_WRITE, with an implied
> +FUA (force unit access). This is implementing using sync_file_range
> +if present, else by fdatasync() of that file (note not all files
> +in a multifile environment). NBD_CMD_WRITE_FUA will not be sent
> +unless NBD_FLAG_SEND_FUA is set.
> +
> +A flush request will not be sent unless NBD_FLAG_SEND_FLUSH is set,
> +and indicates the backing file should be fdatasync()'d to disk.
> +
> There are two versions of the negotiation: the 'old' style (nbd <=
> 2.9.16) and the 'new' style (nbd >= 2.9.17, though due to a bug it does
> not work with anything below 2.9.18). What follows is a description of
> diff --git a/nbd-server.c b/nbd-server.c
> index 4b9a083..88e4615 100644
> --- a/nbd-server.c
> +++ b/nbd-server.c
> @@ -55,6 +55,9 @@
> * Suriya Soutmun <darksolar@...17...>
> */
>
> +/* For sync_file_range */
> +#define _GNU_SOURCE
> +
This should perhaps be inside an #ifdef HAVE_SYNC_FILE_RANGE, too.
> /* Includes LFS defines, which defines behaviours of some of the following
> * headers, so must come before those */
> #include "lfs.h"
> @@ -134,11 +137,13 @@ gboolean do_oldstyle=FALSE;
> #define DEBUG2( a,b ) printf( a,b )
> #define DEBUG3( a,b,c ) printf( a,b,c )
> #define DEBUG4( a,b,c,d ) printf( a,b,c,d )
> +#define DEBUG5( a,b,c,d,e ) printf( a,b,c,d,e )
> #else
> #define DEBUG( a )
> #define DEBUG2( a,b )
> #define DEBUG3( a,b,c )
> #define DEBUG4( a,b,c,d )
> +#define DEBUG5( a,b,c,d,e )
I should probably change these to use C99 varargs macros (we require C99
these days anyway). Oh well, -- some other time, I guess :-)
> #endif
> #ifndef PACKAGE_VERSION
> #define PACKAGE_VERSION ""
> @@ -160,6 +165,9 @@ gboolean do_oldstyle=FALSE;
> #define F_SPARSE 16 /**< flag to tell us copyronwrite should use a
> sparse file */
> #define F_SDP 32 /**< flag to tell us the export should be done using
> the Socket Direct Protocol for RDMA */
> #define F_SYNC 64 /**< Whether to fsync() after a write */
> +#define F_FLUSH 128 /**< Whether server wants FLUSH to be sent by the
> client */
> +#define F_FUA 128 /**< Whether server wants FUA to be sent by the client
These should probably not be the same values?
> */
> +#define F_ROTATIONAL 128 /**< Whether server wants the client to
> implement the elevator algorithm */
> GHashTable *children;
> char pidfname[256]; /**< name of our PID file */
> char pidftemplate[256]; /**< template to be used for the filename of the
> PID file */
> @@ -713,6 +721,9 @@ GArray* parse_cfile(gchar* f, GError** e) {
> { "sparse_cow", FALSE, PARAM_BOOL, NULL, F_SPARSE },
> { "sdp", FALSE, PARAM_BOOL, NULL, F_SDP },
> { "sync", FALSE, PARAM_BOOL, NULL, F_SYNC },
> + { "flush", FALSE, PARAM_BOOL, NULL, F_FLUSH },
> + { "fua", FALSE, PARAM_BOOL, NULL, F_FUA },
> + { "rotational", FALSE, PARAM_BOOL, NULL, F_ROTATIONAL },
> { "listenaddr", FALSE, PARAM_STRING, NULL, 0 },
> { "maxconnections", FALSE, PARAM_INT, NULL, 0 },
> };
> @@ -763,9 +774,10 @@ GArray* parse_cfile(gchar* f, GError** e) {
> lp[6].target=&(s.postrun);
> lp[7].target=lp[8].target=lp[9].target=
> lp[10].target=lp[11].target=
> - lp[12].target=&(s.flags);
> - lp[13].target=&(s.listenaddr);
> - lp[14].target=&(s.max_connections);
> + lp[12].target=lp[13].target=
> + lp[14].target=lp[15].target=&(s.flags);
> + lp[16].target=&(s.listenaddr);
> + lp[17].target=&(s.max_connections);
>
> /* After the [generic] group, start parsing exports */
> if(i==1) {
> @@ -1060,7 +1072,7 @@ void myseek(int handle,off_t a) {
> * @param client The client we're serving for
> * @return The number of bytes actually written, or -1 in case of an error
> **/
> -ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client) {
> +ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client, int
> fua) {
> int fhandle;
> off_t foffset;
> size_t maxbytes;
> @@ -1071,12 +1083,20 @@ ssize_t rawexpwrite(off_t a, char *buf, size_t len,
> CLIENT *client) {
> if(maxbytes && len > maxbytes)
> len = maxbytes;
>
> - DEBUG4("(WRITE to fd %d offset %llu len %u), ", fhandle, foffset, len);
> + DEBUG5("(WRITE to fd %d offset %llu len %u fua %d), ", fhandle, foffset,
> len, fua);
>
> myseek(fhandle, foffset);
> retval = write(fhandle, buf, len);
> if(client->server->flags & F_SYNC) {
> fsync(fhandle);
> + } else if (fua) {
> +#ifdef HAVE_SYNC_FILE_RANGE
> + sync_file_range(fhandle, foffset, len,
> + SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
> + SYNC_FILE_RANGE_WAIT_AFTER);
> +#else
> + fdatasync(fhandle);
> +#endif
> }
> return retval;
> }
> @@ -1085,10 +1105,10 @@ ssize_t rawexpwrite(off_t a, char *buf, size_t len,
> CLIENT *client) {
> * Call rawexpwrite repeatedly until all data has been written.
> * @return 0 on success, nonzero on failure
> **/
> -int rawexpwrite_fully(off_t a, char *buf, size_t len, CLIENT *client) {
> +int rawexpwrite_fully(off_t a, char *buf, size_t len, CLIENT *client, int
> fua) {
> ssize_t ret=0;
>
> - while(len > 0 && (ret=rawexpwrite(a, buf, len, client)) > 0 ) {
> + while(len > 0 && (ret=rawexpwrite(a, buf, len, client, fua)) > 0 ) {
> a += ret;
> buf += ret;
> len -= ret;
> @@ -1189,7 +1209,7 @@ int expread(off_t a, char *buf, size_t len, CLIENT
> *client) {
> * @param client The client we're going to write for.
> * @return 0 on success, nonzero on failure
> **/
> -int expwrite(off_t a, char *buf, size_t len, CLIENT *client) {
> +int expwrite(off_t a, char *buf, size_t len, CLIENT *client, int fua) {
> char pagebuf[DIFFPAGESIZE];
> off_t mapcnt,mapl,maph;
> off_t wrlen,rdlen;
> @@ -1197,7 +1217,7 @@ int expwrite(off_t a, char *buf, size_t len, CLIENT
> *client) {
> off_t offset;
>
> if (!(client->server->flags & F_COPYONWRITE))
> - return(rawexpwrite_fully(a, buf, len, client));
> + return(rawexpwrite_fully(a, buf, len, client, fua));
Gratuitous whitespace change?
> DEBUG3("Asked to write %d bytes at %llu.\n", len, (unsigned long long)a);
>
> mapl=a/DIFFPAGESIZE ; maph=(a+len-1)/DIFFPAGESIZE ;
> @@ -1230,6 +1250,33 @@ int expwrite(off_t a, char *buf, size_t len, CLIENT
> *client) {
> }
> len-=wrlen ; a+=wrlen ; buf+=wrlen ;
> }
> + if (client->server->flags & F_SYNC) {
> + fsync(client->difffile);
> + } else if (fua) {
> + /* open question: would it be cheaper to do multiple sync_file_ranges?
> + as we iterate through the above?
> + */
Good question... it might be a good idea to test this.
> + fdatasync(client->difffile);
> + }
> + return 0;
> +}
> +
> +int expflush(CLIENT *client) {
> + int fhandle;
> + off_t foffset;
> + size_t maxbytes;
> + gint i;
> +
> + if (!(client->server->flags & F_COPYONWRITE)) {
> + return fsync(client->difffile);
> + }
> +
> + for (i = 0; i < client->export->len; i++) {
> + FILE_INFO fi = g_array_index(client->export, FILE_INFO, i);
> + if (fsync(fi.fhandle) < 0)
> + return -1;
> + }
> +
> return 0;
> }
>
> @@ -1322,6 +1369,12 @@ CLIENT* negotiate(int net, CLIENT *client, GArray*
> servers) {
> err("Negotiation failed: %m");
> if (client->server->flags & F_READONLY)
> flags |= NBD_FLAG_READ_ONLY;
> + if (client->server->flags & F_FLUSH)
> + flags |= NBD_FLAG_SEND_FLUSH;
> + if (client->server->flags & F_FUA)
> + flags |= NBD_FLAG_SEND_FUA;
> + if (client->server->flags & F_ROTATIONAL)
> + flags |= NBD_FLAG_ROTATIONAL;
> if (!client->modern) {
> /* oldstyle */
> flags = htonl(flags);
> @@ -1419,7 +1472,7 @@ int mainloop(CLIENT *client) {
> continue;
> }
>
> - if (request.type==NBD_CMD_WRITE) {
> + if ((request.type==NBD_CMD_WRITE) || (request.type==NBD_CMD_WRITE_FUA)) {
This makes me think that the _FUA thing should be something that can be
masked away; that way, we can do stuff like
int type = request.type & ~NBD_CMD_FLAG_FUA;
if (type == NBD_CMD_WRITE) {
...
}
or something similar, which seems cleaner. It would also allow it to be
applied to other request types without having to special-case too much;
e.g., there's been talk of a truncate command, which could sparsify the
backend. I guess that could also use this FUA thing.
> DEBUG("wr: net->buf, ");
> while(len > 0) {
> readit(client->net, buf, currlen);
> @@ -1430,11 +1483,12 @@ int mainloop(CLIENT *client) {
> ERROR(client, reply, EPERM);
> continue;
> }
> - if (expwrite(request.from, buf, len, client)) {
> + if (expwrite(request.from, buf, len, client,
> + (request.type==NBD_CMD_WRITE_FUA)?1:0)) {
... and here you could then use
request.type & NBD_CMD_FLAG_FUA
rather than the tristate.
> DEBUG("Write failed: %m" );
> ERROR(client, reply, errno);
> continue;
> - }
> + }
Another gratuitous whitespace change.
> SEND(client->net, reply);
> DEBUG("OK!\n");
> len -= currlen;
> @@ -1442,6 +1496,19 @@ int mainloop(CLIENT *client) {
> }
> continue;
> }
> +
> + if (request.type==NBD_CMD_FLUSH) {
> + DEBUG("fl: ");
> + if (expflush(client)) {
> + DEBUG("Flush failed: %m");
> + ERROR(client, reply, errno);
> + continue;
> + }
> + SEND(client->net, reply);
> + DEBUG("OK!\n");
> + continue;
> + }
> +
> /* READ */
>
> DEBUG("exp->buf, ");
> diff --git a/nbd.h b/nbd.h
> index 451f50c..05488b0 100644
> --- a/nbd.h
> +++ b/nbd.h
> @@ -31,7 +31,9 @@
> enum {
> NBD_CMD_READ = 0,
> NBD_CMD_WRITE = 1,
> - NBD_CMD_DISC = 2
> + NBD_CMD_DISC = 2,
> + NBD_CMD_FLUSH = 3,
> + NBD_CMD_WRITE_FUA = 4
> };
This file is taken from the kernel, so should probably be patched there
rather than here (though of course it would be hard to compile-test
without these bits).
Please also document the new config file options in nbd-server.5.in.sgml.
Other than that, seems fine to me.
--
The volume of a pizza of thickness a and radius z can be described by
the following formula:
pi zz a
Reply to: