[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] nbd hangs when disconnection the network



Roy,

Thanks for the quick and insightful reply.

echo "5" > /proc/sys/net/ipv4/tcp_retries2

brought the wait down to just a few seconds.

My only worry is what happens if the network gets busy -- will it start dropping connections? I suppose this is probably extremely unlikely with the two machines connected to the same Gb switch. Is there any other issues with setting the retry value so low?

I don't know about being generally applicable, but it seems like any time nbd is involved in a raid device a tunable timeout parameter would be valuable. I would be happy to test any patch that might be written!

Steven

Roy Keene wrote:

You could try changing the value of /proc/sys/net/ipv4/tcp_retries2.

The problem is that nbd-client hands over control of the device to the kernel through an ioctl() call (ioctl(..., NBD_DO_IT)) and if the connection dies after that, it's that kernel code's job to notice this and return an error after it times out.

Since it's in kernel code and not in nbd-client code, we can't just set an alarm and cancel it if we get keep-alives, since we're not handling any of that.

So the only knob we can easily tune is the TCP retransmit timeout values.

Failing that, we can look at patching the kernel NBD code with a tuneable timeout parameter.

On Thu, 19 Jan 2006, Steven Yelton wrote:

I have a problem with the nbd-client hanging when the network cable is

removed from the server.  Here is my setup:

storage1:
exporting raid1a
exporting raid1c

storage2:
exporting raid1b

client machine:
md0 is raid5 with nbd{0,1,2}

The raid builds and runs fine. If I kill the nbd-server on 'storage2' the raid immediately goes into a 'degraded' state (exactly as I would expect). However, if I just pull the network connection from 'storage2', md0 just hangs (even `cat /proc/mdstat` hangs). After several minutes (10, maybe) the client seems to notice the server is dead (Error: Connect: No route to host) and the raid is degraded.

What can I do to decrease the time it takes for nbd-client to realize it can't get to the storage machine anymore?

Thanks in advance,
Steven




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general




Reply to: