[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

VM global patch / fixing blockdev deadlocks in Kernel 2.4.5?



Hello Andrea (hope I found the correct address),

I'm working on a Linux distribution running entirely from CD, using the
(de)compressed loopback device for which I'm co-author, which is
basically the Kernel 2.2 loop.c with a gzip-like decompressor built in. I
successfully ported this to 2.4.5 without any problems, at first.

Back when using Kernel 2.2.18, I came across a strange effect when 
doing some parallelized IO on the cloop-mounted device. The IO suddenly
stops without any kernel panic or other error message, furthermore,
not only the cloop device hangs but all other mounted blick devices as
well. I never found out what exactly happened, but spend a lot of time
of tracing and rewriting plus speed-improving cloop.o without finding
any obvious error. The location where it hang was definitely in
ll_rw_blk(), but I lost trace from when schedule() was called.

Applying your 2.2.15pre VM-global-patch apparently solved the
problem (at least, it never occured again since then).

However, it's back in 2.4.5! When randomly accessing files (such as
doing a "tar cpPvf - /mounted_cloop_iso9660 | dd of=/dev/null"), IO on
all mounted block devices sometimes suddenly stops after a while.
This does not seem to be related to the amount of memory/cache/swap,
read errors on CD-Rom or other obvious things. I suspected a deadlock on
concurrently running device IO queues.

I looked into your patch and found that some of your changes have been
incorporated into 2.4.5, some others not (expecially your improved
locking mechanism for cached buffers). I wondered if there is the same kind
of patch available for 2.4.5 that you made for 2.2.18, or if it is really a
healthy idea for me to successively trying to apply the same kind of
changes you did for 2.2.18, to 2.4.5.

So, maybe you already worked on this and have a hint for me?

Since I'm running out of time (the press date for our CD is next week),
I will try to apply some of your blkdev patches from
ftp://ftp.de.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/ and
see what happens.

Of course I would appreciate any kind of hint or insight from you
regarding this deadlock condition, which does not directly seem to be
related to the loopback/cloop device but rather to the way the kernel
handles buffered blocks in general, or maybe it's even a VM issue
(though I would expect kernel panics if this was the case).

Thanks in advance!

-Klaus Knopper
PS: The source of cloop can be found at
http://www.knopper.net/knoppix/sources/cloop.tar.bz2, in case you want
to know. I'm positively sure that the cause of the error is not in
there, but only occurs during ll_rw_blk().
---
Klaus Knopper                  LinuxTag 2001 - Europes largest Linux Expo
Technical Solutions                                 Where .com meets .org
knopper@linuxtag.de                               http://www.linuxtag.de/
Phone +49-(0)180-5-546898                         Fax +49-(0)180-5-546893


Reply to: