[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Nbd] nbd-server working easily in cygwin in XP



I conveniently made a version that works for cygwin in XP and put it in
<ftp://ftp.sonic.net/pub/users/qm/nbd/nbd-2.9.11-cygwin.tar.gz>.
Everything I did is in the README.cygwin, which I append to this message,
so if you have this message, you won't need the tarball (but it makes it simpler).

I believe the fix is a simple bug, either in cygwin, or in nbd.  Someone
with knowledge in the area of my bug/hack fix should certainly fix it.
My fix is not generalized, and I haven't yet tested the changes to nbd-server
in Linux/unix yet to see if it actually breaks that (although of course my nbd-client
was in Linux/unix).

-----

Brad Allen <Ulmo@...206...>

These are the changes I made to make nbd-server work under Cygwin in
XP.  It was quite easy.  It took very little to make nbd-server work
in Cygwin in XP.  All it needed was nbd.h and fcntl(...&~O_NONBLOCK).

The steps below are separated with the following line:

======================================================

0.  Obviously, every time ./configure says something is missing, go get it!
    I have a healthy cygwin installation (that means a lot of stuff is
    present).  If you don't know enough to select the right cygwin
    packages, then you are brave indeed!  To such a brave soul, go and
    install EVERY cygwin package that might be used, and you'll be
    just fine.  (I don't; no room, and no time for all the insane
    conflicts, which I've never seen so maybe there aren't any.)

    Be sure to include glib2-devel.  I have no idea what it is, but
    they say nbd needs it, and indeed it does (I tried without it,
    since I have a no-Gnome policy).

======================================================

1.  "./configure --prefix=/usr/local"  failed with the following:

    checking where to find a working nbd.h... configure: error: Could not find an nbd.h from 2.6 or above.

    So, I got nbd.h from a recent kernel and put it in this directory,
    and commented out all the #include statements (which caused it to
    fail to compile correctly).  I put an example at nbd.h-cygwin
    (which I got from linux-2.6.26/include/linux/nbd.h); rename or
    symlink it to nbd.h.

    Here's a patch for commenting out the includes if you get a fresh copy.

--- /usr/src/linux/include/linux/nbd.h  2008-07-13 14:51:29.000000000 -0700
+++ nbd.h       2008-07-18 12:21:42.187500000 -0700
@@ -15,7 +15,7 @@
 #ifndef LINUX_NBD_H
 #define LINUX_NBD_H

-#include <linux/types.h>
+/*#include <linux/types.h>*/

 #define NBD_SET_SOCK   _IO( 0xab, 0 )
 #define NBD_SET_BLKSIZE        _IO( 0xab, 1 )
@@ -39,8 +39,8 @@
 /* userspace doesn't need the nbd_device structure */
 #ifdef __KERNEL__

-#include <linux/wait.h>
-#include <linux/mutex.h>
+/*#include <linux/wait.h>*/
+/*#include <linux/mutex.h>*/

 /* values for flags field */
 #define NBD_READ_ONLY 0x0001

======================================================

2.  I commented out the junk in "configure" that didn't work without the
    above nbd.h, or with it with the #includes still in it.
    I don't know how to program "configure.in", so as you can guess,
    my programming abilities to fix and enhance programs has been on a
    hiatus for the last 15 years as a result.

    But, once I got nbd.h in the local directory to work anyway, I
    didn't need to fix configure any more.  So, you get to skip this
    step as long as your nbd.h works right as above.

======================================================

3.  I found out nbd-server disconnected whenever nbd-client connected to it.

    (Irrelevent to the bugs and my solutions, but just to let you know
    that it does work and is not an issue, during my tests, my file to
    serve was actually a normal file inside an NTFS filesystem, and
    then since I got it to work like that, I then gave it its own
    partition (which was my goal all along, using the kindness of "lvm
    pvmove"), as you see here in the example debugging output.  Note
    that cygwin assigns primary and logical/extended partitions all to
    /dev/sd[a-l][0-15] or such, and they all work fine, except they
    won't work with partition type 8e; I had to change the partition
    type to 83 to make it work (originally I thought maybe cygwin only
    looked at "primary" partitions, but that's not true; they can
    definately be "logical"; partition TYPE codes are all that matter,
    probably to XP, which then passes its problems on to cygwin).
    Since Linux doesn't pay attention to partition type, this was a
    trivial decision for me.  (It actually contains LUKS data, so I
    don't care at all anyway.))

    Here's a sample with DODBG=1 and NOFORK=1 in config.h:

Waiting for connections... bind, listen, accept, ** Message: connect from 192.168.1.102, assigned file is /dev/sda3
** Message: Authorized client
** Message: Starting to serve
Opening /dev/sda3
looking for fhandle size with fstat
looking for fhandle size with lseek SEEK_END
** Message: Size of exported file/device is 7641252864
Entering request loop!
1: *Error: Read failed: Resource temporarily unavailable

    (Note that the above is WITH step #5 patch included which prints
    the device size correctly (otherwise it would say file/device size
    is 3346285568 or something like that, causing me to think the
    error might have been size related, which it wasn't and I
    copiously confirmed; nbd-server in cygwin in XP handles 7.5G file
    sizes fine, as well as )).


    The bug shows itself at the following two locations; I found it
    with Google, because the error was "Resource temporarily
    unavailable", and someone in Google said that maps to "EAGAIN".
    Without Google, I would have found the #define for EAGAIN and the
    perror message in /usr/include myself, and it would have gone
    faster.  That shows how Google is actually less efficient than the
    right way to do things.  Anyway, the problem here is that EAGAIN
    means you ought to try again.  What google was good for was
    explaining why: NONBLOCKing IO was set, and in step 4 you see the
    proper fix.  You shouldn't need to do this goto loop patchlet; I
    include it for your debugging and programming enjoyment.  I'm not
    against goto's -- I'm against the unnecessary hack to make it work
    until I found the problem, especially because of the CPU impact.
    BTW, with the following patch alone, it did work, but with 100%
    CPU doing polling instead of waiting properly.  So, don't apply this
    patch.

diff -Nrup nbd-2.9.11/nbd-server.c nbd-2.9.11.ulmo/nbd-server.c
--- nbd-2.9.11/nbd-server.c     2008-05-01 12:04:44.000000000 -0700
+++ nbd-2.9.11.ulmo/nbd-server.c        2008-07-18 15:04:39.687500000 -0700
@@ -304,8 +304,14 @@ inline void readit(int f, void *buf, siz
        ssize_t res;
        while (len > 0) {
                DEBUG("*");
-               if ((res = read(f, buf, len)) <= 0)
+               loop:
+               if ((res = read(f, buf, len)) <= 0) {
+                       if (errno==EAGAIN) {
+                               DEBUG("[");
+                               goto loop;
+                       }
                        err("Read failed: %m");
+               }
                len -= res;
                buf += res;
        }
@@ -322,8 +328,14 @@ inline void writeit(int f, void *buf, si
        ssize_t res;
        while (len > 0) {
                DEBUG("+");
-               if ((res = write(f, buf, len)) <= 0)
+               loop:
+               if ((res = write(f, buf, len)) <= 0) {
+                       if (errno==EAGAIN) {
+                               DEBUG("]");
+                               goto loop;
+                       }
                        err("Send failed: %m");
+               }
                len -= res;
                buf += res;
        }

As you can see above, the above has both debugging information and a
theory of the bug applied to it.  The theory, as you will see in the
following debugging output, was correct.  Note that in the output
below, there were many more '[' and ']' characters, but I kindly
edited them out and replaced them with the # count deleted (and note
there would be many, many more of them if I didn't have a quick
sequence of commands like I did in my sample command set).  I left
some of the shorter long lines.  What I did on the client host was
"/usr/sbin/nbd-client 192.168.bla.bla 1000 /dev/nbd0;dd
if=/dev/nbd0|hexdump -C|head -30;dd if=/dev/nbd0 of=/tmp/save
cont=1;dd if=/etc/passwd of=/dev/nbd0 count=1 conv=notrunc;sync;sleep
1;dd if=/tmp/save of=/dev/nbd0 count=1 conv=notrunc;sync;nbd-client -d
/dev/nbd0", as a simple sample to show you the output.  Note that the
"sleep 1" in the sample command set was necessary to cause Linux to
write out the sync, which was also necessary.  (The device size is
correct: that's my bigger partition.  Works great (with bug fix
below)!)

Waiting for connections... bind, listen, Waiting for connections... bind, listen, Waiting for connections... bind, listen, accept, ** Message: connect from 192.168.1.102, assigned file is /dev/sda4
** Message: Authorized client
** Message: Starting to serve
Opening /dev/sda4
looking for fhandle size with fstat
looking for fhandle size with lseek SEEK_END
** Message: Size of exported file/device is 35082608640
Entering request loop!
1: *[[[[[[[[[[[[[[[[[[...1150 '['...][[[[[[[[[[[[[[[READ from 0 (0) len 0, exp->buf, (READ from fd 7 offset 0 len 0), buf->net, +OK!
2: *[[[[[[[...400 '['...[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[READ from 16384 (0) len 32, exp->buf, (READ from fd 7 offset 16384 len 0), buf->net, +OK!
3: *[[[[[[[...604 '['...[[[[[[[[[[[[[[[[[[READ from 114688 (0) len 224, exp->buf, (READ from fd 7 offset 114688 len 0), buf->net, +OK!
4: *READ from 244736 (0) len 478, exp->buf, (READ from fd 7 offset 244736 len 0), buf->net, +]]]]]...419 ']'...]]]]]]]]]]]]]]OK!
5: *[[[[[[[[[[[[...1550 '['...[[[[[[[[[[[[[[[[[[[[[[[[[[WRITE from 0 (0) len 0, wr: net->buf, *buf->exp, (WRITE to fd 7 offset 0 len 0), +OK!
6: *[[[[[[...53290 '['...[[[[[[[[[[[WRITE from 0 (0) len 0, wr: net->buf, *buf->exp, (WRITE to fd 7 offset 0 len 0), +OK!
7: *[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[** Message: Disconnect request received.


As you can see, now it is working better.  The client also worked.

Note that with NOFORK turned on, it wouldn't work with multiple
clients (I tried, not thinking about it much; not sure DODBG is much
use without NOFORK, though; would require some hacking to make that
useful).

======================================================

4.  This is the main patch to fix the bug.  It's amazingly simple, and
    meant that my nap to figure it out was very worthwhile.  What a nap!
    Many, many dreams.  Took me a while, though (couple of hours).

    So, why is this?  Does Cygwin keep the NONBLOCK from when the
    socket was just listening and unconnected vs. Linux which resets
    it to BLOCKING IO?  Anyway, this was necessary with Cygwin, and
    not with Linux.

diff -Nrup nbd-2.9.11/nbd-server.c nbd-2.9.11.ulmo/nbd-server.c
--- nbd-2.9.11/nbd-server.c     2008-05-01 12:04:44.000000000 -0700
+++ nbd-2.9.11.ulmo/nbd-server.c        2008-07-18 15:04:39.687500000 -0700
@@ -1059,6 +1071,16 @@ void negotiate(CLIENT *client) {
        char zeros[128];
        u64 size_host;
        u32 flags = NBD_FLAG_HAS_FLAGS;
+       int sock_flags;
+
+       /* make the socket blocking */
+       if ((sock_flags = fcntl(client->net, F_GETFL, 0)) == -1) {
+               err("fcntl F_GETFL");
+       }
+       if (fcntl(client->net, F_SETFL, sock_flags &~O_NONBLOCK) == -1) {
+               err("fcntl F_SETFL ~O_NONBLOCK");
+       }
+

        memset(zeros, '\0', sizeof(zeros));
        if (write(client->net, INIT_PASSWD, 8) < 0)

Debugging output for the above sample run becomes:

Waiting for connections... bind, listen, Waiting for connections... bind, listen, Waiting for connections... bind, listen, accept, ** Message: connect from 192.168.1.102, assigned file is /dev/sda4
** Message: Authorized client
** Message: Starting to serve
Opening /dev/sda4
looking for fhandle size with fstat
looking for fhandle size with lseek SEEK_END
** Message: Size of exported file/device is 35082608640
Entering request loop!
1: *READ from 0 (0) len 0, exp->buf, (READ from fd 7 offset 0 len 0), buf->net, +OK!
2: *READ from 16384 (0) len 32, exp->buf, (READ from fd 7 offset 16384 len 0), buf->net, +OK!
3: *READ from 114688 (0) len 224, exp->buf, (READ from fd 7 offset 114688 len 0), buf->net, +OK!
4: *READ from 244736 (0) len 478, exp->buf, (READ from fd 7 offset 244736 len 0), buf->net, +OK!
5: *WRITE from 0 (0) len 0, wr: net->buf, *buf->exp, (WRITE to fd 7 offset 0 len 0), +OK!
6: *WRITE from 0 (0) len 0, wr: net->buf, *buf->exp, (WRITE to fd 7 offset 0 len 0), +OK!
7: *** Message: Disconnect request received.

And, now it works fine.

======================================================

5.  In cygwin in XP, a big file I had of 7.5G worked fine, but when I had
    DODBG defined to 1 (in config.h) and it printed the size, the size
    was minus 2^32, so I assume it just wraps at 2^32.  So to isolate
    that that wasn't the problem, besides testing smaller pieces and
    stuff, I put in a simple test to check if the variable was at
    least correct.  Then I improved it a bit, to actually print right.
    You'll note my beautifully written new function that prints any
    decimal number into a static string.  Now when debugging, it prints
    the right size (in my case, 7641252864).  It also works right with
    another one of size 35082608640.  The first one of 7641252864 was
    originally a 7.5GB file on an NTFS filesystem, but I shrunk it and
    moved it to its own partition.  They all worked fine.

--- nbd-2.9.11/nbd-server.c.~1~ 2008-05-01 12:04:44.000000000 -0700
+++ nbd-2.9.11/nbd-server.c     2008-07-20 09:17:27.093750000 -0700
@@ -239,6 +239,12 @@
                                  is PARAM_BOOL. */
 } PARAM;

+typedef unsigned long long big_t; /* any int type works (e.g., signed) */
+#define MXSTRNMSZ (21) /* max signed long long is -9223372036854775808\0 */
+char *strnm(big_t x){static char s[MXSTRNMSZ],*c;register big_t y,z;c=s;for(
+  z=1,y=x;y/=10;)z*=10;y=x/z;x-=y*z;if(y<0){*c++='-';x*=-1;y*=-1;}*c++='0'+y;
+  z/=10;while(z){*c++='0'+(y=x/z);x-=y*z;z/=10;}*c='\0';return s;}
+
 /**
  * Check whether a client is allowed to connect. Works with an authorization
  * file which contains one line per machine, no wildcards.
@@ -1257,7 +1263,7 @@
                client->exportsize = client->server->expected_size;
        }

-       msg3(LOG_INFO, "Size of exported file/device is %Lu", (unsigned long long)client->exportsize);
+       msg3(LOG_INFO, "Size of exported file/device is %s", (strnm((unsigned long long)client->exportsize)));
        if(multifile) {
                msg3(LOG_INFO, "Total number of files: %d", i);
        }

======================================================

Please remember to read the README for configuration file options.  I
think it ought to install that sample config file inside the README as
"CONFDIR/nbd-server/config.sample", but that's just me.  I put it
there in my system.

What else?  It just seems to work.  Let me know of any changes.




Reply to: