[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Nbd] nbd-server hanging issue



----- Original Message ----- 
From: "Wouter Verhelst" <wouter@...3...>
To: "JaniD++" <djani22@...60...>
Cc: <nbd-general@lists.sourceforge.net>
Sent: Friday, April 14, 2006 5:38 PM
Subject: Re: [Nbd] nbd-server hanging issue


> On Thu, Apr 13, 2006 at 09:52:30AM +0200, JaniD++ wrote:
> > Hello, list
> >
> > I use the 2.8.4 server, and the -persist capable client.
> > Today, one of my nodes hangs, and i cannot kill the server!
>
> Err, that shouldn't happen. Ever.
>
> > [root@...83... root]# killall -KILL nbd-server
> > [root@...83... root]# ps fax | grep nbd
> > 27330 pts/2    S      0:00          \_ grep nbd
> > 17001 ?        D<   230:27 nbd-server 1230 /dev/md0 2097000
> > 26959 ?        D<     0:00 nbd-server 1230 /dev/md0 2097000
> > [root@...83... root]# killall -KILL nbd-server
> > [root@...83... root]# killall -KILL nbd-server
> > [root@...83... root]# killall -KILL nbd-server
> > [root@...83... root]# killall -KILL nbd-server
> > [root@...83... root]# ps fax | grep nbd
> > 27336 pts/2    S      0:00          \_ grep nbd
> > 17001 ?        D<   230:27 nbd-server 1230 /dev/md0 2097000
> > 26959 ?        D<     0:00 nbd-server 1230 /dev/md0 2097000
> > [root@...83... root]#
> >
> > No dmesg message, just hang.
> > (I know this caused by some garbage from the network, but i cannot
> > reproduce...)
>
> Which obviously is going to make debugging rather hard. Still, I'd like
> to try to reproduce this; could you send me a bit more information on
> how this problem occurred that might help me out?

OK, i try.

At this point, my client down completely, and nobody is used the server.

The issue is basically on the new e1000 driver (7.0.? series) on my client.
That generates some garbage sometimes.

dmesg:
nfs: server 192.168.2.1 not responding, still trying
e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <ac>
  TDT                  <9a>
  next_to_use          <9a>
  next_to_clean        <ab>
buffer_info[next_to_clean]
  time_stamp           <4bcea2>
  next_to_watch        <af>
  jiffies              <4bd086>
  next_to_watch.status <0>
NETDEV WATCHDOG: eth2: transmit timed out
e1000: eth2: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
nfs: server 192.168.2.1 not responding, still trying

When this happens, some nodes gets errors, dropped packages, and bad frames.

On the server side (in 4/4 nodes) there is the same driver, but another
version of e1000 chip, and the issue didn't come.
server (node) kernel 2.6.16.1.

Next time, if i get this, i will use sysreq+d, to dump the debug infos. ;-)

Cheers,
Janos

>
> If the system is still running in this broken way, could you run
> "netstat -t" and see whether that returns any high numbers in either the
> "Recv-Q" or "Send-Q" columns?
>
> -- 
> Fun will now commence
>   -- Seven Of Nine, "Ashes to Ashes", stardate 53679.4
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting
language
> that extends applications into web and mobile media. Attend the live
webcast
> and join the prime developer group breaking into this new coding
territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general




Reply to: