[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] [RFC] [PATCH] support FUA and FLUSH



On Mon, May 16, 2011 at 01:46:09PM +0100, Alex Bligh wrote:
> Enclosed is a patch (COMPILE TESTED ONLY - DO NOT RUN ON LIVE SYSTEMS) which
> supports FUA (force unit access) and FLUSH.
> 
> FUA means "force the current write to hit the media service" and FLUSH
> means "empty the current write queue to disk". Broadly they have the same
> semantics as the linux kernel REQ_FLUSH and REQ_FUA. FUA is implemented
> through sync_file_range() (or if that doesn't exist, fdatasync() on the
> file handle concerned), FLUSH through fdatasync() on all files. FUA and
> FLUSH are selected in the config file, and set new flags bits which will
> cause the client to sent FUA and FLUSH requests. The way these are
> implemented is further explained in doc/proto.txt.
> 
> The purpose of this is reasonably obvious: without supporting either FUA
> or FLUSH (and it's relatively easy to support both), filesystems on the
> client have no way to ensure the relevant sectors have hit the disk.
> The patch is implement such that the default behaviour is unchanged.
> 
> Additionally, it introduces an F_ROTATIONAL flag. This will turn off
> the use of QUEUE_FLAG_NONROT in the client. QUEUE_FLAG_NONROT effectively
> disables the elevator algorithm, making the algorithm merge only. That is
> unhelpful where the server does not have its own elevator algorithm
> or where the client elevator algorithm is neutered (e.g. writing to a
> raw partition with F_SYNC with nbd-server). It's not going to be used
> often where the backing store is a file.
> 
> It also incidentally fixes a bug where F_SYNC is ignored if F_COPYONWRITE
> is set (not that this currently has much utility).
> 
> I have not written the client side of this yet (see comment re "compile
> tested only"). What I'd like to know (before I spend too much time on it)
> is whether this is a useful course to pursue, and whether I'm doing this
> in a vaguely useful way. I repeat the point about this code not being
> tested at all beyond compilation.
> 
> Clean version of patch at
>   http://www.alex.org.uk/nbd-flush-fua-01.patch
> in case my mailer mangles newlines
> 
> -- 
> Alex Bligh
> 
> Signed-Off-By: Alex Bligh <alex@...872...>
> 
> diff --git a/cliserv.h b/cliserv.h
> index 51c8bd1..63c2743 100644
> --- a/cliserv.h
> +++ b/cliserv.h
> @@ -40,7 +40,12 @@ typedef unsigned long long u64;
>  #include "nbd.h"
> 
>  #if NBD_LFS==1
> +/* /usr/include/features.h (included from /usr/include/sys/types.h)
> +   defines this when _GNU_SOURCE is defined
> + */
> +#ifndef _LARGEFILE_SOURCE
>  #define _LARGEFILE_SOURCE
> +#endif
>  #define _FILE_OFFSET_BITS 64
>  #endif
> 
> @@ -135,6 +140,9 @@ u64 ntohll(u64 a) {
>  /* Flags used between the client and server */
>  #define NBD_FLAG_HAS_FLAGS	(1 << 0)	/* Flags are there */
>  #define NBD_FLAG_READ_ONLY	(1 << 1)	/* Device is read-only */
> +#define NBD_FLAG_SEND_FLUSH	(1 << 2)	/* Send FLUSH */
> +#define NBD_FLAG_SEND_FUA	(1 << 3)	/* Send FLUSH */

"Send FUA" rather than "Send FLUSH"? Also, please expand what 'FUA'
stands for in this comment.

> +#define NBD_FLAG_ROTATIONAL	(1 << 3)	/* Use elevator algorithm - 
> rotational media */

That's the same value as the _SEND_FUA thing here; should that not be
different?

>  #define NBD_DEFAULT_PORT	"10809"	/* Port on which named exports are
>  					 * served */
> diff --git a/configure.ac b/configure.ac
> index f50193f..c78177b 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -94,6 +94,9 @@ AC_CHECK_SIZEOF(unsigned int)
>  AC_CHECK_SIZEOF(unsigned long int)
>  AC_CHECK_SIZEOF(unsigned long long int)
>  AC_CHECK_FUNCS([llseek alarm gethostbyname inet_ntoa memset socket 
> strerror strstr])
> +AC_CHECK_FUNC([sync_file_range],
> +	[AC_DEFINE([HAVE_SYNC_FILE_RANGE], [sync_file_range(2) is not supported], 
> [sync_file_range(2) is supported])],
> +        [])
>  AC_FUNC_FORK
>  AC_FUNC_SETVBUF_REVERSED
>  AC_MSG_CHECKING(whether client should be built)
> diff --git a/doc/proto.txt b/doc/proto.txt
> index fe5e819..d9a269b 100644
> --- a/doc/proto.txt
> +++ b/doc/proto.txt
> @@ -26,8 +26,9 @@ server during the handshake.
>  There are two message types in the data pushing phase: the request, and
>  the response.
> 
> -There are three request types in the data pushing phase: NBD_CMD_READ,
> -NBD_CMD_WRITE, and NBD_CMD_DISC (disconnect).
> +There are four request types in the data pushing phase: NBD_CMD_READ,
> +NBD_CMD_WRITE, NBD_CMD_WRITE_FUA, NBD_CMD_DISC (disconnect), and
> +NBD_CMD_FLUSH.
> 
>  The request is sent by the client; the response by the server. A request
>  header consists a 32 bit magic number (magic), a 32 bit field denoting
> @@ -50,6 +51,15 @@ we change that to asynchronous handling, handling the 
> disconnect request
>  will probably be postponed until there are no other outstanding
>  requests.
> 
> +NBD_CMD_WRITE_FUA is the equivalent of NBD_WRITE, with an implied
> +FUA (force unit access). This is implementing using sync_file_range
> +if present, else by fdatasync() of that file (note not all files
> +in a multifile environment). NBD_CMD_WRITE_FUA will not be sent
> +unless NBD_FLAG_SEND_FUA is set.
> +
> +A flush request will not be sent unless NBD_FLAG_SEND_FLUSH is set,
> +and indicates the backing file should be fdatasync()'d to disk.
> +
>  There are two versions of the negotiation: the 'old' style (nbd <=
>  2.9.16) and the 'new' style (nbd >= 2.9.17, though due to a bug it does
>  not work with anything below 2.9.18). What follows is a description of
> diff --git a/nbd-server.c b/nbd-server.c
> index 4b9a083..88e4615 100644
> --- a/nbd-server.c
> +++ b/nbd-server.c
> @@ -55,6 +55,9 @@
>   *	Suriya Soutmun <darksolar@...17...>
>   */
> 
> +/* For sync_file_range */
> +#define _GNU_SOURCE
> +

This should perhaps be inside an #ifdef HAVE_SYNC_FILE_RANGE, too.

>  /* Includes LFS defines, which defines behaviours of some of the following
>   * headers, so must come before those */
>  #include "lfs.h"
> @@ -134,11 +137,13 @@ gboolean do_oldstyle=FALSE;
>  #define DEBUG2( a,b ) printf( a,b )
>  #define DEBUG3( a,b,c ) printf( a,b,c )
>  #define DEBUG4( a,b,c,d ) printf( a,b,c,d )
> +#define DEBUG5( a,b,c,d,e ) printf( a,b,c,d,e )
>  #else
>  #define DEBUG( a )
>  #define DEBUG2( a,b )
>  #define DEBUG3( a,b,c )
>  #define DEBUG4( a,b,c,d )
> +#define DEBUG5( a,b,c,d,e )

I should probably change these to use C99 varargs macros (we require C99
these days anyway). Oh well, -- some other time, I guess :-)

>  #endif
>  #ifndef PACKAGE_VERSION
>  #define PACKAGE_VERSION ""
> @@ -160,6 +165,9 @@ gboolean do_oldstyle=FALSE;
>  #define F_SPARSE 16	  /**< flag to tell us copyronwrite should use a 
> sparse file */
>  #define F_SDP 32	  /**< flag to tell us the export should be done using 
> the Socket Direct Protocol for RDMA */
>  #define F_SYNC 64	  /**< Whether to fsync() after a write */
> +#define F_FLUSH 128	  /**< Whether server wants FLUSH to be sent by the 
> client */
> +#define F_FUA 128	  /**< Whether server wants FUA to be sent by the client 
These should probably not be the same values?

> */
> +#define F_ROTATIONAL 128  /**< Whether server wants the client to 
> implement the elevator algorithm */
>  GHashTable *children;
>  char pidfname[256]; /**< name of our PID file */
>  char pidftemplate[256]; /**< template to be used for the filename of the 
> PID file */
> @@ -713,6 +721,9 @@ GArray* parse_cfile(gchar* f, GError** e) {
>  		{ "sparse_cow",	FALSE,	PARAM_BOOL,	NULL, F_SPARSE },
>  		{ "sdp",	FALSE,	PARAM_BOOL,	NULL, F_SDP },
>  		{ "sync",	FALSE,  PARAM_BOOL,	NULL, F_SYNC },
> +		{ "flush",	FALSE,  PARAM_BOOL,	NULL, F_FLUSH },
> +		{ "fua",	FALSE,  PARAM_BOOL,	NULL, F_FUA },
> +		{ "rotational",	FALSE,  PARAM_BOOL,	NULL, F_ROTATIONAL },
>  		{ "listenaddr", FALSE,  PARAM_STRING,   NULL, 0 },
>  		{ "maxconnections", FALSE, PARAM_INT,	NULL, 0 },
>  	};
> @@ -763,9 +774,10 @@ GArray* parse_cfile(gchar* f, GError** e) {
>  		lp[6].target=&(s.postrun);
>  		lp[7].target=lp[8].target=lp[9].target=
>  				lp[10].target=lp[11].target=
> -				lp[12].target=&(s.flags);
> -		lp[13].target=&(s.listenaddr);
> -		lp[14].target=&(s.max_connections);
> +				lp[12].target=lp[13].target=
> +				lp[14].target=lp[15].target=&(s.flags);
> +		lp[16].target=&(s.listenaddr);
> +		lp[17].target=&(s.max_connections);
> 
>  		/* After the [generic] group, start parsing exports */
>  		if(i==1) {
> @@ -1060,7 +1072,7 @@ void myseek(int handle,off_t a) {
>   * @param client The client we're serving for
>   * @return The number of bytes actually written, or -1 in case of an error
>   **/
> -ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client) {
> +ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client, int 
> fua) {
>  	int fhandle;
>  	off_t foffset;
>  	size_t maxbytes;
> @@ -1071,12 +1083,20 @@ ssize_t rawexpwrite(off_t a, char *buf, size_t len, 
> CLIENT *client) {
>  	if(maxbytes && len > maxbytes)
>  		len = maxbytes;
> 
> -	DEBUG4("(WRITE to fd %d offset %llu len %u), ", fhandle, foffset, len);
> +	DEBUG5("(WRITE to fd %d offset %llu len %u fua %d), ", fhandle, foffset, 
> len, fua);
> 
>  	myseek(fhandle, foffset);
>  	retval = write(fhandle, buf, len);
>  	if(client->server->flags & F_SYNC) {
>  		fsync(fhandle);
> +	} else if (fua) {
> +#ifdef HAVE_SYNC_FILE_RANGE
> +		sync_file_range(fhandle, foffset, len,
> +				SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
> +				SYNC_FILE_RANGE_WAIT_AFTER);
> +#else
> +		fdatasync(fhandle);
> +#endif
>  	}
>  	return retval;
>  }
> @@ -1085,10 +1105,10 @@ ssize_t rawexpwrite(off_t a, char *buf, size_t len, 
> CLIENT *client) {
>   * Call rawexpwrite repeatedly until all data has been written.
>   * @return 0 on success, nonzero on failure
>   **/
> -int rawexpwrite_fully(off_t a, char *buf, size_t len, CLIENT *client) {
> +int rawexpwrite_fully(off_t a, char *buf, size_t len, CLIENT *client, int 
> fua) {
>  	ssize_t ret=0;
> 
> -	while(len > 0 && (ret=rawexpwrite(a, buf, len, client)) > 0 ) {
> +	while(len > 0 && (ret=rawexpwrite(a, buf, len, client, fua)) > 0 ) {
>  		a += ret;
>  		buf += ret;
>  		len -= ret;
> @@ -1189,7 +1209,7 @@ int expread(off_t a, char *buf, size_t len, CLIENT 
> *client) {
>   * @param client The client we're going to write for.
>   * @return 0 on success, nonzero on failure
>   **/
> -int expwrite(off_t a, char *buf, size_t len, CLIENT *client) {
> +int expwrite(off_t a, char *buf, size_t len, CLIENT *client, int fua) {
>  	char pagebuf[DIFFPAGESIZE];
>  	off_t mapcnt,mapl,maph;
>  	off_t wrlen,rdlen;
> @@ -1197,7 +1217,7 @@ int expwrite(off_t a, char *buf, size_t len, CLIENT 
> *client) {
>  	off_t offset;
> 
>  	if (!(client->server->flags & F_COPYONWRITE))
> -		return(rawexpwrite_fully(a, buf, len, client));
> +	  return(rawexpwrite_fully(a, buf, len, client, fua));

Gratuitous whitespace change?

>  	DEBUG3("Asked to write %d bytes at %llu.\n", len, (unsigned long long)a);
> 
>  	mapl=a/DIFFPAGESIZE ; maph=(a+len-1)/DIFFPAGESIZE ;
> @@ -1230,6 +1250,33 @@ int expwrite(off_t a, char *buf, size_t len, CLIENT 
> *client) {
>  		}						
>  		len-=wrlen ; a+=wrlen ; buf+=wrlen ;
>  	}
> +	if (client->server->flags & F_SYNC) {
> +		fsync(client->difffile);
> +	} else if (fua) {
> +		/* open question: would it be cheaper to do multiple sync_file_ranges?
> +		   as we iterate through the above?
> +		 */

Good question... it might be a good idea to test this.

> +		fdatasync(client->difffile);
> +	}
> +	return 0;
> +}
> +
> +int expflush(CLIENT *client) {
> +	int fhandle;
> +	off_t foffset;
> +	size_t maxbytes;
> +	gint i;
> +
> +        if (!(client->server->flags & F_COPYONWRITE)) {
> +		return fsync(client->difffile);
> +	}
> +	
> +	for (i = 0; i < client->export->len; i++) {
> +		FILE_INFO fi = g_array_index(client->export, FILE_INFO, i);
> +		if (fsync(fi.fhandle) < 0)
> +			return -1;
> +	}
> +	
>  	return 0;
>  }
> 
> @@ -1322,6 +1369,12 @@ CLIENT* negotiate(int net, CLIENT *client, GArray* 
> servers) {
>  		err("Negotiation failed: %m");
>  	if (client->server->flags & F_READONLY)
>  		flags |= NBD_FLAG_READ_ONLY;
> +	if (client->server->flags & F_FLUSH)
> +		flags |= NBD_FLAG_SEND_FLUSH;
> +	if (client->server->flags & F_FUA)
> +		flags |= NBD_FLAG_SEND_FUA;
> +	if (client->server->flags & F_ROTATIONAL)
> +		flags |= NBD_FLAG_ROTATIONAL;
>  	if (!client->modern) {
>  		/* oldstyle */
>  		flags = htonl(flags);
> @@ -1419,7 +1472,7 @@ int mainloop(CLIENT *client) {
>  			continue;
>  		}
> 
> -		if (request.type==NBD_CMD_WRITE) {
> +		if ((request.type==NBD_CMD_WRITE) || (request.type==NBD_CMD_WRITE_FUA)) {

This makes me think that the _FUA thing should be something that can be
masked away; that way, we can do stuff like

int type = request.type & ~NBD_CMD_FLAG_FUA;
if (type == NBD_CMD_WRITE) {
	...
}

or something similar, which seems cleaner. It would also allow it to be
applied to other request types without having to special-case too much;
e.g., there's been talk of a truncate command, which could sparsify the
backend. I guess that could also use this FUA thing.

>  			DEBUG("wr: net->buf, ");
>  			while(len > 0) {
>  				readit(client->net, buf, currlen);
> @@ -1430,11 +1483,12 @@ int mainloop(CLIENT *client) {
>  					ERROR(client, reply, EPERM);
>  					continue;
>  				}
> -				if (expwrite(request.from, buf, len, client)) {
> +				if (expwrite(request.from, buf, len, client,
> +					     (request.type==NBD_CMD_WRITE_FUA)?1:0)) {

... and here you could then use

request.type & NBD_CMD_FLAG_FUA

rather than the tristate.

>  					DEBUG("Write failed: %m" );
>  					ERROR(client, reply, errno);
>  					continue;
> -				}
> +				}	

Another gratuitous whitespace change.

>  				SEND(client->net, reply);
>  				DEBUG("OK!\n");
>  				len -= currlen;
> @@ -1442,6 +1496,19 @@ int mainloop(CLIENT *client) {
>  			}
>  			continue;
>  		}
> +
> +		if (request.type==NBD_CMD_FLUSH) {
> +			DEBUG("fl: ");
> +			if (expflush(client)) {
> +				DEBUG("Flush failed: %m");
> +				ERROR(client, reply, errno);
> +				continue;
> +			}
> +			SEND(client->net, reply);
> +			DEBUG("OK!\n");
> +			continue;
> +		}
> +
>  		/* READ */
> 
>  		DEBUG("exp->buf, ");
> diff --git a/nbd.h b/nbd.h
> index 451f50c..05488b0 100644
> --- a/nbd.h
> +++ b/nbd.h
> @@ -31,7 +31,9 @@
>  enum {
>  	NBD_CMD_READ = 0,
>  	NBD_CMD_WRITE = 1,
> -	NBD_CMD_DISC = 2
> +	NBD_CMD_DISC = 2,
> +	NBD_CMD_FLUSH = 3,
> +	NBD_CMD_WRITE_FUA = 4
>  };

This file is taken from the kernel, so should probably be patched there
rather than here (though of course it would be hard to compile-test
without these bits).

Please also document the new config file options in nbd-server.5.in.sgml.

Other than that, seems fine to me.

-- 
The volume of a pizza of thickness a and radius z can be described by
the following formula:

pi zz a



Reply to: