[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Nbd] [RFC] [PATCH] support FUA and FLUSH



Enclosed is a patch (COMPILE TESTED ONLY - DO NOT RUN ON LIVE SYSTEMS) which
supports FUA (force unit access) and FLUSH.

FUA means "force the current write to hit the media service" and FLUSH
means "empty the current write queue to disk". Broadly they have the same
semantics as the linux kernel REQ_FLUSH and REQ_FUA. FUA is implemented
through sync_file_range() (or if that doesn't exist, fdatasync() on the
file handle concerned), FLUSH through fdatasync() on all files. FUA and
FLUSH are selected in the config file, and set new flags bits which will
cause the client to sent FUA and FLUSH requests. The way these are
implemented is further explained in doc/proto.txt.

The purpose of this is reasonably obvious: without supporting either FUA
or FLUSH (and it's relatively easy to support both), filesystems on the
client have no way to ensure the relevant sectors have hit the disk.
The patch is implement such that the default behaviour is unchanged.

Additionally, it introduces an F_ROTATIONAL flag. This will turn off
the use of QUEUE_FLAG_NONROT in the client. QUEUE_FLAG_NONROT effectively
disables the elevator algorithm, making the algorithm merge only. That is
unhelpful where the server does not have its own elevator algorithm
or where the client elevator algorithm is neutered (e.g. writing to a
raw partition with F_SYNC with nbd-server). It's not going to be used
often where the backing store is a file.

It also incidentally fixes a bug where F_SYNC is ignored if F_COPYONWRITE
is set (not that this currently has much utility).

I have not written the client side of this yet (see comment re "compile
tested only"). What I'd like to know (before I spend too much time on it)
is whether this is a useful course to pursue, and whether I'm doing this
in a vaguely useful way. I repeat the point about this code not being
tested at all beyond compilation.

Clean version of patch at
 http://www.alex.org.uk/nbd-flush-fua-01.patch
in case my mailer mangles newlines

--
Alex Bligh

Signed-Off-By: Alex Bligh <alex@...872...>

diff --git a/cliserv.h b/cliserv.h
index 51c8bd1..63c2743 100644
--- a/cliserv.h
+++ b/cliserv.h
@@ -40,7 +40,12 @@ typedef unsigned long long u64;
#include "nbd.h"

#if NBD_LFS==1
+/* /usr/include/features.h (included from /usr/include/sys/types.h)
+   defines this when _GNU_SOURCE is defined
+ */
+#ifndef _LARGEFILE_SOURCE
#define _LARGEFILE_SOURCE
+#endif
#define _FILE_OFFSET_BITS 64
#endif

@@ -135,6 +140,9 @@ u64 ntohll(u64 a) {
/* Flags used between the client and server */
#define NBD_FLAG_HAS_FLAGS	(1 << 0)	/* Flags are there */
#define NBD_FLAG_READ_ONLY	(1 << 1)	/* Device is read-only */
+#define NBD_FLAG_SEND_FLUSH	(1 << 2)	/* Send FLUSH */
+#define NBD_FLAG_SEND_FUA	(1 << 3)	/* Send FLUSH */
+#define NBD_FLAG_ROTATIONAL (1 << 3) /* Use elevator algorithm - rotational media */

#define NBD_DEFAULT_PORT	"10809"	/* Port on which named exports are
					 * served */
diff --git a/configure.ac b/configure.ac
index f50193f..c78177b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -94,6 +94,9 @@ AC_CHECK_SIZEOF(unsigned int)
AC_CHECK_SIZEOF(unsigned long int)
AC_CHECK_SIZEOF(unsigned long long int)
AC_CHECK_FUNCS([llseek alarm gethostbyname inet_ntoa memset socket strerror strstr])
+AC_CHECK_FUNC([sync_file_range],
+ [AC_DEFINE([HAVE_SYNC_FILE_RANGE], [sync_file_range(2) is not supported], [sync_file_range(2) is supported])],
+        [])
AC_FUNC_FORK
AC_FUNC_SETVBUF_REVERSED
AC_MSG_CHECKING(whether client should be built)
diff --git a/doc/proto.txt b/doc/proto.txt
index fe5e819..d9a269b 100644
--- a/doc/proto.txt
+++ b/doc/proto.txt
@@ -26,8 +26,9 @@ server during the handshake.
There are two message types in the data pushing phase: the request, and
the response.

-There are three request types in the data pushing phase: NBD_CMD_READ,
-NBD_CMD_WRITE, and NBD_CMD_DISC (disconnect).
+There are four request types in the data pushing phase: NBD_CMD_READ,
+NBD_CMD_WRITE, NBD_CMD_WRITE_FUA, NBD_CMD_DISC (disconnect), and
+NBD_CMD_FLUSH.

The request is sent by the client; the response by the server. A request
header consists a 32 bit magic number (magic), a 32 bit field denoting
@@ -50,6 +51,15 @@ we change that to asynchronous handling, handling the disconnect request
will probably be postponed until there are no other outstanding
requests.

+NBD_CMD_WRITE_FUA is the equivalent of NBD_WRITE, with an implied
+FUA (force unit access). This is implementing using sync_file_range
+if present, else by fdatasync() of that file (note not all files
+in a multifile environment). NBD_CMD_WRITE_FUA will not be sent
+unless NBD_FLAG_SEND_FUA is set.
+
+A flush request will not be sent unless NBD_FLAG_SEND_FLUSH is set,
+and indicates the backing file should be fdatasync()'d to disk.
+
There are two versions of the negotiation: the 'old' style (nbd <=
2.9.16) and the 'new' style (nbd >= 2.9.17, though due to a bug it does
not work with anything below 2.9.18). What follows is a description of
diff --git a/nbd-server.c b/nbd-server.c
index 4b9a083..88e4615 100644
--- a/nbd-server.c
+++ b/nbd-server.c
@@ -55,6 +55,9 @@
 *	Suriya Soutmun <darksolar@...17...>
 */

+/* For sync_file_range */
+#define _GNU_SOURCE
+
/* Includes LFS defines, which defines behaviours of some of the following
 * headers, so must come before those */
#include "lfs.h"
@@ -134,11 +137,13 @@ gboolean do_oldstyle=FALSE;
#define DEBUG2( a,b ) printf( a,b )
#define DEBUG3( a,b,c ) printf( a,b,c )
#define DEBUG4( a,b,c,d ) printf( a,b,c,d )
+#define DEBUG5( a,b,c,d,e ) printf( a,b,c,d,e )
#else
#define DEBUG( a )
#define DEBUG2( a,b )
#define DEBUG3( a,b,c )
#define DEBUG4( a,b,c,d )
+#define DEBUG5( a,b,c,d,e )
#endif
#ifndef PACKAGE_VERSION
#define PACKAGE_VERSION ""
@@ -160,6 +165,9 @@ gboolean do_oldstyle=FALSE;
#define F_SPARSE 16 /**< flag to tell us copyronwrite should use a sparse file */ #define F_SDP 32 /**< flag to tell us the export should be done using the Socket Direct Protocol for RDMA */
#define F_SYNC 64	  /**< Whether to fsync() after a write */
+#define F_FLUSH 128 /**< Whether server wants FLUSH to be sent by the client */ +#define F_FUA 128 /**< Whether server wants FUA to be sent by the client */ +#define F_ROTATIONAL 128 /**< Whether server wants the client to implement the elevator algorithm */
GHashTable *children;
char pidfname[256]; /**< name of our PID file */
char pidftemplate[256]; /**< template to be used for the filename of the PID file */
@@ -713,6 +721,9 @@ GArray* parse_cfile(gchar* f, GError** e) {
		{ "sparse_cow",	FALSE,	PARAM_BOOL,	NULL, F_SPARSE },
		{ "sdp",	FALSE,	PARAM_BOOL,	NULL, F_SDP },
		{ "sync",	FALSE,  PARAM_BOOL,	NULL, F_SYNC },
+		{ "flush",	FALSE,  PARAM_BOOL,	NULL, F_FLUSH },
+		{ "fua",	FALSE,  PARAM_BOOL,	NULL, F_FUA },
+		{ "rotational",	FALSE,  PARAM_BOOL,	NULL, F_ROTATIONAL },
		{ "listenaddr", FALSE,  PARAM_STRING,   NULL, 0 },
		{ "maxconnections", FALSE, PARAM_INT,	NULL, 0 },
	};
@@ -763,9 +774,10 @@ GArray* parse_cfile(gchar* f, GError** e) {
		lp[6].target=&(s.postrun);
		lp[7].target=lp[8].target=lp[9].target=
				lp[10].target=lp[11].target=
-				lp[12].target=&(s.flags);
-		lp[13].target=&(s.listenaddr);
-		lp[14].target=&(s.max_connections);
+				lp[12].target=lp[13].target=
+				lp[14].target=lp[15].target=&(s.flags);
+		lp[16].target=&(s.listenaddr);
+		lp[17].target=&(s.max_connections);

		/* After the [generic] group, start parsing exports */
		if(i==1) {
@@ -1060,7 +1072,7 @@ void myseek(int handle,off_t a) {
 * @param client The client we're serving for
 * @return The number of bytes actually written, or -1 in case of an error
 **/
-ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client) {
+ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client, int fua) {
	int fhandle;
	off_t foffset;
	size_t maxbytes;
@@ -1071,12 +1083,20 @@ ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client) {
	if(maxbytes && len > maxbytes)
		len = maxbytes;

-	DEBUG4("(WRITE to fd %d offset %llu len %u), ", fhandle, foffset, len);
+ DEBUG5("(WRITE to fd %d offset %llu len %u fua %d), ", fhandle, foffset, len, fua);

	myseek(fhandle, foffset);
	retval = write(fhandle, buf, len);
	if(client->server->flags & F_SYNC) {
		fsync(fhandle);
+	} else if (fua) {
+#ifdef HAVE_SYNC_FILE_RANGE
+		sync_file_range(fhandle, foffset, len,
+				SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
+				SYNC_FILE_RANGE_WAIT_AFTER);
+#else
+		fdatasync(fhandle);
+#endif
	}
	return retval;
}
@@ -1085,10 +1105,10 @@ ssize_t rawexpwrite(off_t a, char *buf, size_t len, CLIENT *client) {
 * Call rawexpwrite repeatedly until all data has been written.
 * @return 0 on success, nonzero on failure
 **/
-int rawexpwrite_fully(off_t a, char *buf, size_t len, CLIENT *client) {
+int rawexpwrite_fully(off_t a, char *buf, size_t len, CLIENT *client, int fua) {
	ssize_t ret=0;

-	while(len > 0 && (ret=rawexpwrite(a, buf, len, client)) > 0 ) {
+	while(len > 0 && (ret=rawexpwrite(a, buf, len, client, fua)) > 0 ) {
		a += ret;
		buf += ret;
		len -= ret;
@@ -1189,7 +1209,7 @@ int expread(off_t a, char *buf, size_t len, CLIENT *client) {
 * @param client The client we're going to write for.
 * @return 0 on success, nonzero on failure
 **/
-int expwrite(off_t a, char *buf, size_t len, CLIENT *client) {
+int expwrite(off_t a, char *buf, size_t len, CLIENT *client, int fua) {
	char pagebuf[DIFFPAGESIZE];
	off_t mapcnt,mapl,maph;
	off_t wrlen,rdlen;
@@ -1197,7 +1217,7 @@ int expwrite(off_t a, char *buf, size_t len, CLIENT *client) {
	off_t offset;

	if (!(client->server->flags & F_COPYONWRITE))
-		return(rawexpwrite_fully(a, buf, len, client));
+	  return(rawexpwrite_fully(a, buf, len, client, fua));
	DEBUG3("Asked to write %d bytes at %llu.\n", len, (unsigned long long)a);

	mapl=a/DIFFPAGESIZE ; maph=(a+len-1)/DIFFPAGESIZE ;
@@ -1230,6 +1250,33 @@ int expwrite(off_t a, char *buf, size_t len, CLIENT *client) {
		}						
		len-=wrlen ; a+=wrlen ; buf+=wrlen ;
	}
+	if (client->server->flags & F_SYNC) {
+		fsync(client->difffile);
+	} else if (fua) {
+		/* open question: would it be cheaper to do multiple sync_file_ranges?
+		   as we iterate through the above?
+		 */
+		fdatasync(client->difffile);
+	}
+	return 0;
+}
+
+int expflush(CLIENT *client) {
+	int fhandle;
+	off_t foffset;
+	size_t maxbytes;
+	gint i;
+
+        if (!(client->server->flags & F_COPYONWRITE)) {
+		return fsync(client->difffile);
+	}
+	
+	for (i = 0; i < client->export->len; i++) {
+		FILE_INFO fi = g_array_index(client->export, FILE_INFO, i);
+		if (fsync(fi.fhandle) < 0)
+			return -1;
+	}
+	
	return 0;
}

@@ -1322,6 +1369,12 @@ CLIENT* negotiate(int net, CLIENT *client, GArray* servers) {
		err("Negotiation failed: %m");
	if (client->server->flags & F_READONLY)
		flags |= NBD_FLAG_READ_ONLY;
+	if (client->server->flags & F_FLUSH)
+		flags |= NBD_FLAG_SEND_FLUSH;
+	if (client->server->flags & F_FUA)
+		flags |= NBD_FLAG_SEND_FUA;
+	if (client->server->flags & F_ROTATIONAL)
+		flags |= NBD_FLAG_ROTATIONAL;
	if (!client->modern) {
		/* oldstyle */
		flags = htonl(flags);
@@ -1419,7 +1472,7 @@ int mainloop(CLIENT *client) {
			continue;
		}

-		if (request.type==NBD_CMD_WRITE) {
+		if ((request.type==NBD_CMD_WRITE) || (request.type==NBD_CMD_WRITE_FUA)) {
			DEBUG("wr: net->buf, ");
			while(len > 0) {
				readit(client->net, buf, currlen);
@@ -1430,11 +1483,12 @@ int mainloop(CLIENT *client) {
					ERROR(client, reply, EPERM);
					continue;
				}
-				if (expwrite(request.from, buf, len, client)) {
+				if (expwrite(request.from, buf, len, client,
+					     (request.type==NBD_CMD_WRITE_FUA)?1:0)) {
					DEBUG("Write failed: %m" );
					ERROR(client, reply, errno);
					continue;
-				}
+				}	
				SEND(client->net, reply);
				DEBUG("OK!\n");
				len -= currlen;
@@ -1442,6 +1496,19 @@ int mainloop(CLIENT *client) {
			}
			continue;
		}
+
+		if (request.type==NBD_CMD_FLUSH) {
+			DEBUG("fl: ");
+			if (expflush(client)) {
+				DEBUG("Flush failed: %m");
+				ERROR(client, reply, errno);
+				continue;
+			}
+			SEND(client->net, reply);
+			DEBUG("OK!\n");
+			continue;
+		}
+
		/* READ */

		DEBUG("exp->buf, ");
diff --git a/nbd.h b/nbd.h
index 451f50c..05488b0 100644
--- a/nbd.h
+++ b/nbd.h
@@ -31,7 +31,9 @@
enum {
	NBD_CMD_READ = 0,
	NBD_CMD_WRITE = 1,
-	NBD_CMD_DISC = 2
+	NBD_CMD_DISC = 2,
+	NBD_CMD_FLUSH = 3,
+	NBD_CMD_WRITE_FUA = 4
};

#define nbd_cmd(req) ((req)->cmd[0])




Reply to: