Bug#594923: marked as done (Zero writeback interval sends flush processes into busy loop)

To: Jonathan Nieder <jrnieder@gmail.com>
Subject: Bug#594923: marked as done (Zero writeback interval sends flush processes into busy loop)
From: owner@bugs.debian.org (Debian Bug Tracking System)
Date: Thu, 05 Apr 2012 20:54:08 +0000
Message-id: <[🔎] handler.594923.D594923.133365917522764.ackdone@bugs.debian.org>
References: <20120405205235.GA7490@burratino> <20100829175324.5559.98564.reportbug@localnet>

Your message dated Thu, 5 Apr 2012 15:52:35 -0500
with message-id <20120405205235.GA7490@burratino>
and subject line Re: Noflushd causes flush- processes to eat all CPU
has caused the Debian Bug report #594923,
regarding Zero writeback interval sends flush processes into busy loop
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
594923: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=594923
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems

--- Begin Message ---

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: noflushd: Noflushd causes flush- processes to eat all CPU
From: Xavier Roche <xavier@debian.org>
Date: Sun, 29 Aug 2010 19:53:24 +0200
Message-id: <20100829175324.5559.98564.reportbug@localnet>

Package: noflushd
Version: 2.8-1
Severity: important

I think the problem might be still there, when some monitored disks are becoming automatically idle (or through "hdparm -S242").

Note that the given disks do not need to have pending write, apparently, for the problem to be reproducible.

I managed to reproduce the issue after a clean reboot (and after
removing some potentially new options from the grsecurity kernel - to be
sure that this was not a possible cause) on a fresh 2.6.34.4 kernel.

I started noflushd, and then waited for some time, and the problem appeared again. Monitored disks are all configured to go in idle after a while (using "hdparm -S242 /dev/.." at startup)

In this state, the noflushd daemon is still running (and not consumming
cpu), but flush-* process do:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13604 root      20   0     0    0    0 R 48.1  0.0   8:19.70 flush-34:0
13605 root      20   0     0    0    0 R 48.1  0.0   5:42.76 flush-8:0

After a while, more flush- processes appears, and the load increases.

The noflushd demon appears to be still running (it is NOT stuck, even if
flush-* kernel jobs are stuck), and each 5 seconds attempt to do fsync's()

nanosleep({5, 0}, {5, 0})               = 0
time(NULL)                              = 1283100653
_llseek(5, 0, [0], SEEK_SET)            = 0
read(5, "   3      64 hdb 98217 251654 278"..., 1024) = 1024
read(5, "0 0 0 0 0 0 0 0 0 0\n"..., 1024) = 20
read(5, ""..., 1024)                    = 0
time(NULL)                              = 1283100653
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, "major minor  #blocks  name\n\n   3 "..., 1024) = 354
fsync(6)                                = 0
fsync(7)                                = 0
fsync(10)                               = 0
fsync(11)                               = 0
fsync(12)                               = 0
fsync(13)                               = 0
fsync(14)                               = 0
fsync(15)                               = 0
read(3, ""..., 1024)                    = 0
fsync(16)                               = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({5, 0}, ..
(repeated endlessly - ie. it does not wait 60 seconds as it used to do
before)

(/proc/<pid-of-flush-processed>/wchan gives 0)

No i/o activity on disk, but load increasing as flush- process appears.

After touching the mounted directory corresponding to the idle disk to force a disk spinup (a "ls" will take several seconds until the disk is back to normal), the load goes back to zero, and the system sync stucked processes returns.

The noflushd process then goes back to a 60 second loop:

time(NULL)                              = 1283100976
_llseek(5, 0, [0], SEEK_SET)            = 0
read(5, "   3      64 hdb 98222 251654 278"..., 1024) = 1024
read(5, "0 0 0 0 0 0 0 0 0 0\n"..., 1024) = 20
read(5, ""..., 1024)                    = 0
time(NULL)                              = 1283100976
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, "major minor  #blocks  name\n\n   3 "..., 1024) = 354
fsync(6)                                = 0
fsync(7)                                = 0
fsync(10)                               = 0
fsync(11)                               = 0
fsync(12)                               = 0
fsync(13)                               = 0
fsync(14)                               = 0
fsync(15)                               = 0
read(3, ""..., 1024)                    = 0
fsync(16)                               = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({5, 0},

{5, 0})               = 0
time(NULL)                              = 1283100981
_llseek(5, 0, [0], SEEK_SET)            = 0
read(5, "   3      64 hdb 98222 251654 278"..., 1024) = 1024
read(5, "0 0 0 0 0 0 0 0 0 0\n"..., 1024) = 20
read(5, ""..., 1024)                    = 0
time(NULL)                              = 1283100981
time(NULL)                              = 1283100981
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, "major minor  #blocks  name\n\n   3 "..., 1024) = 354
fsync(8)                                = 0
fsync(9)                                = 0
_llseek(4, 0, [0], SEEK_SET)            = 0
write(4, "500\n"..., 4)                 = 4
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({60, 0},

At this time, I think that the suspend mode might be the root of all
evil ; I don't known how it can impact noflushd anyway. Setting up disks
to automatically enter in standby mode (hdparm -S242 /dev/hd${dev}) appears to be the cause.

Using noflushd 2.8-1 ; Linux kernel 2.6.34.4.

I'm available to do more tests if necessary.


-- System Information:
Debian Release: 5.0.5
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 2.6.34.4-grsec (SMP w/1 CPU core)
Locale: LANG=fr_FR.UTF-8@euro, LC_CTYPE=fr_FR.UTF-8@euro (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages noflushd depends on:
ii  debconf [debconf-2.0]         1.5.24     Debian configuration management sy
ii  ed                            0.7-3      The classic unix line editor
ii  libc6                         2.11.2-2   Embedded GNU C Library: Shared lib

noflushd recommends no packages.

noflushd suggests no packages.

-- debconf information:
  noflushd/expert: false
* noflushd/disks: /dev/hdb /dev/hde /dev/hdg
  noflushd/params:
* noflushd/timeout: 60

--- End Message ---

--- Begin Message ---

To: Xavier Roche <roche@httrack.com>

Cc: 594923-done@bugs.debian.org, Daniel Kobras <kobras@debian.org>

Subject: Re: Noflushd causes flush- processes to eat all CPU

From: Jonathan Nieder <jrnieder@gmail.com>

Date: Thu, 5 Apr 2012 15:52:35 -0500

Message-id: <20120405205235.GA7490@burratino>

In-reply-to: <20120304064938.GE14725@burratino>

References: <4C7AC4DD.3000101@httrack.com> <handler.594812.B594812.128311656529112.ackinfo@bugs.debian.org> <alpine.DEB.1.10.1008301323100.7203@linux.localnet> <20100830190931.GB12220@hamnixda.de> <20110903052152.GA1355@elie> <20111017112117.GA26050@elie.hsd1.il.comcast.net> <4F51CD6A.9080900@httrack.com> <20120303135551.GA2348@burratino> <4F52777F.7040100@httrack.com> <20120304064938.GE14725@burratino>
Version: 2.6.32-42

> Xavier Roche wrote:
>> Le 03/03/2012 14:55, Jonathan Nieder a écrit :

>>> Does the attached patch help?
>>
>> Yes, mostly.

Applied in 2.6.32.59.  Confirmation either way about the fix would
still be welcome, as always.
--- End Message ---

Reply to:

Prev by Date: Processed: Re: Novatel Wireless USB 3G modem ID 1410:7001 no longer works out of the box
Next by Date: Bug#659363: Hibernate freezes on HP dc7900 with Linux 3.2 (regression)
Previous by thread: Processed: Re: Novatel Wireless USB 3G modem ID 1410:7001 no longer works out of the box
Next by thread: Bug#652056: [3.0 -> 3.1.1 regression] [ipw2100] BUG at net/core/dev.c:3719!
Index(es):
- Date
- Thread