[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#588675: Summary of observations of #588675



Control: retitle 588675 SCSI subsystem loses name of root device on boot
Control: severity 588675 normal
Control: found 588675 3.2.78-1
Control: found 588675 3.16.7-ckt20-1+deb8u3
Control: found 588675 3.16.7-ckt25-2
Control: found 588675 2.6.18

According to the advanced information on the BTS, under severity levels:

   wishlist
          for any feature request, and also for any bugs that are very
          difficult to fix due to major design considerations.

The first condition is untrue, this is definitely a bug.  While the
damage may not be that major, it is pretty widespread.  If the Debian
kernel maintainers were to claim this wasn't a problem, then I would be
forced to report another bug against src:linux since the kernel build
scripts themselves are confused by this behavior!

The second condition requires a judgement call to evaluate, but looking
at things I'm pretty sure it is untrue.  I'm guessing this is simply one
crucial field that needs to be copied by the SCSI subsystem, but is not.
Since many other subsystems manage to copy the value, almost certainly
the change is small.  I'd be surprised if it took more than 4 lines to
fix (two of which being blank and one being a comment).  I will concede
this may need expertise on how /proc/mounts works and the interface
between that and the driver subsystems (alternatively simply looking for
one field which is ignored may be enough), but with that this should be a
simple fix.

Meanwhile the damage from this bug may not be that large, but it is
rather widespread.  I know of 4 reports where this is the root cause and
I imagine there are others I do not know of.  There may also be many
utilities that already work around this bug and hundreds of scripts that
are similarly forced to do so.

This bug has also wasted a great deal of time trying to figure out where
to attribute the issue.  My earliest observations were close to a decade
ago, but I didn't feel confident placing blame anywhere.  Then more
recently I had to spend time building several kernels to confirm the
conditions under which the problem occurred.


Uneffected systems:

This group consists of all system where the root filesystem is NOT on a
device that directly plugs into the SCSI subsystem.  It does not matter
whether an initial ramdisk is used or not.  This includes systems like:

root on Linux software RAID:
$ awk '$2 == "/" && $1 != "rootfs"' < /proc/mounts
/dev/md0 / ext3 ro 0 0
$ 
I recall this system being in service from around 2.6.5(?) to 2.6.18 or
so.  Even though the immediate driver was the MD subsystem, underlying
this were SCSI devices.  This is long in the past, but I'd already been
observing the bug by then (and wondering where to point the finger).

root on olde IDE devices, on the olde IDE subsystem:
$ awk '$2 == "/" && $1 != "rootfs"' < /proc/mounts
/dev/hda1 / ext3 ro 0 0
$ 
I think this system managed to remain in service into the 2.6.29
timeframe, but is also no longer in service.  This does give an example
of the root filesystem being on a different subsystem though.  Crucially
this is prior to the olde IDE subsystem being retired and the driver for
PATA devices which plugged into the SCSI subsystem coming into service.

root on MTD devices:
$ awk '$2 == "/" && $1 != "rootfs"' < /proc/mounts
/dev/mtdblock4 / jffs2 rw,relatime 0 0
$ 
A very different system here.  Different filesystem and rather different
device.  This one hasn't been tried with kernels earlier than 3.2, but
seems to echo other observations.  This one is in active service and due
to interesting setup allows for testing of some interesting scenarios.

root on BLK_DEV_IDE_PMAC (olde Mac IDE subsystem?):
This is Christian Kujau's report in bug #588675.  I believe
BLK_DEV_IDE_PMAC would be a PowerMac analog of the x86 IDE driver which
had it's own subsystem and which didn't plug into the SCSI subsystem.


Effected systems:

This group consists of all system where the root filesystem is on a
device that directly plugs into the SCSI subsystem and the system
directly mounts that device at boot.  On such systems:

$ awk '$2 == "/" && $1 != "rootfs"' < /proc/mounts
/dev/root / <somefs> ro,relatime 0 0
$ 

Most of my systems are running ext3, but Christian Kujau confirmed this
with ext4 and jfs.  Christian Kujau also observed this with the
PATA_MACIO driver, which I believe is a Macintosh equivalent of the x86
PATA driver which plugs into the SCSI subsystem.  I've observed this on
many different systems with devices which plug into the SCSI subsystem,
this includes a 3ware card, SATA disks, USB flash drives and genuine SCSI
disks.


Workaround:

The workaround that bypasses the problem is to initially mount some other
device as root, then pivot_root or such onto the real root.  Using an
initial ramdisk is one example of this.  From the DebWRT project I'm also
aware of the case of booting onto a root on MTD and then doing a
pivot_root onto a USB flash key works arount the issue.

$ awk '$2 == "/" && $1 != "rootfs"' < /proc/mounts
/dev/sda1 / ext3 ro,noatime,nodiratime,acl,barrier=1 0 0
$ 

Problem is this is only working around the underlying cause.  On a system
with limited memory and little non-SCSI storage (think embedded systems)
it could be impossible to avoid directly mounting the real SCSI root
filesystem on boot.  Anyone who needs to build a custom kernel for
various reasons will likely know the root device and want to build a
kernel which directly mounts it.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         EHeM+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445


Reply to: