[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#339080: Frequent crash in handle_IRQ_event on alpha with kernel 2.6



Package: linux-2.6
Tags: patch

Since beginning of 2005 I tried different kernel/Linux-images 2.6.x on my Alphastation 500/500:
=============================================================================================
cpu                     : Alpha
cpu model               : EV56
cpu variation           : 7
cpu revision            : 0
cpu serial number       :
system type             : Alcor
system variation        : Alcor
system revision         : 0
system serial number    :
cycle frequency [Hz]    : 500000000
timer frequency [Hz]    : 1024.00
page size [bytes]       : 8192
phys. address bits      : 40
max. addr. space #      : 127
BogoMIPS                : 994.44
kernel unaligned acc    : 0 (pc=0,va=0)
user unaligned acc      : 0 (pc=0,va=0)
platform string         : Digital AlphaStation 500/500
cpus detected           : 1
L1 Icache               : 8K, 1-way, 32b line
L1 Dcache               : 8K, 1-way, 32b line
L2 cache                : 96K, 3-way, 64b line
L3 cache                : 8192K, 1-way, 64b line

Tried kernels were 2.6.8-1, 2.6.8-2, 2.6.10, 2.6.12,... All kernels crash on this machine with the following message (ksymoops):
ksymoops 2.4.9 on alpha 2.6.8-2-generic.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.6.8-2-generic/ (default)
     -m /boot/System.map-2.6.8-2-generic (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Error (regular_file): read_ksyms stat /proc/ksyms failed
No modules in ksyms, skipping objects
No ksyms, skipping lsmod
Trace:
[<fffffc000031a164>] handle_IRQ_event+0x74/0xf0
[<fffffc000031ab50>] handle_irq+0xe0/0x1c0
[<fffffc0000329b04>] srm_device_interrupt+0x24/0x40
[<fffffc000031b1f4>] do_entInt+0xf4/0x140
[<fffffc0000315260>] ret_from_sys_call+0x0/0x10
[<fffffc0000316e30>] default_idle+0x0/0x10
[<fffffc0000316e98>] cpu_idel+0x58/0x80
[<fffffc0000316e30>] default_idle+0x0/0x10
[<fffffc0000316e30>] default_idle+0x0/0x10
[<fffffc0000310234>] rest_init+0x34/0x50
[<fffffc000031001c>] __start+0x1c/0x20
Code: 243f0010 245f0020 21c10100 21a20200 a4490008 a4290000 <b4410008> b4220000
Using defaults from ksymoops -t elf64-alpha -a alpha


Trace; fffffc000031a164 <handle_IRQ_event+74/f0>
Trace; fffffc000031ab50 <handle_irq+e0/1c0>
Trace; fffffc0000329b04 <srm_device_interrupt+24/40>
Trace; fffffc000031b1f4 <do_entInt+f4/140>
Trace; fffffc0000315260 <ret_from_sys_call+0/10>
Trace; fffffc0000316e30 <default_idle+0/10>
Trace; fffffc0000316e98 <cpu_idle+58/80>
Trace; fffffc0000316e30 <default_idle+0/10>
Trace; fffffc0000316e30 <default_idle+0/10>
Trace; fffffc0000310234 <rest_init+34/50>
Trace; fffffc000031001c <_stext+1c/20>

Code;  ffffffffffffffe8 <END_OF_CODE+3ffff9a83a8/????>
0000000000000000 <_PC>:
Code;  ffffffffffffffe8 <END_OF_CODE+3ffff9a83a8/????>
   0:   10 00 3f 24       ldah t0,16
Code;  ffffffffffffffec <END_OF_CODE+3ffff9a83ac/????>
   4:   20 00 5f 24       ldah t1,32
Code;  fffffffffffffff0 <END_OF_CODE+3ffff9a83b0/????>
   8:   00 01 c1 21       lda  s5,256(t0)
Code;  fffffffffffffff4 <END_OF_CODE+3ffff9a83b4/????>
   c:   00 02 a2 21       lda  s4,512(t1)
Code;  fffffffffffffff8 <END_OF_CODE+3ffff9a83b8/????>
  10:   08 00 49 a4       ldq  t1,8(s0)
Code;  fffffffffffffffc <END_OF_CODE+3ffff9a83bc/????>
  14:   00 00 29 a4       ldq  t0,0(s0)
Code;  0000000000000000 Before first symbol
  18:   08 00 41 b4       stq  t1,8(t0)
Code;  0000000000000004 Before first symbol
  1c:   00 00 22 b4       stq  t0,0(t1)

Kernel panic: Aiee, killing interrupt handler!
=============================================================================================

The time of the crash depends on multiple factors. Sometimes after 3 hours, sometimes after two days, but mostly during idle time. A device driver is not affected because the crash always occurs inside arch/alpha/kernel/irq.c in function handle_irq_event. This could be a problem after the call to an interrupt handler of a driver but this also happens with changed hardware/drivers (2 different drivers for scsi, 3 different drivers for ethernet, with/without SATA, with/without USB). Nevertheless, here the hardware configuration:

=============================================================================================
0000:00:06.0 Ethernet controller: Digital Equipment Corporation DECchip 21040 [Tulip] (rev 26)
        Flags: bus master, medium devsel, latency 255, IRQ 29
        I/O ports at 9400 [size=128]
        Memory at 00000000022dd000 (32-bit, non-prefetchable) [size=128]

0000:00:07.0 RAID bus controller: Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02) Subsystem: Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3114 SATARaid Controller
        Flags: bus master, 66MHz, medium devsel, latency 240, IRQ 24
        I/O ports at 9810 [size=8]
        I/O ports at 9820 [size=4]
        I/O ports at 9818 [size=8]
        I/O ports at 9824 [size=4]
        I/O ports at 9800 [size=16]
        Memory at 00000000022db000 (32-bit, non-prefetchable) [size=1K]
        Expansion ROM at 0000000002200000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 2

0000:00:08.0 VGA compatible controller: Digital Equipment Corporation PBXGB [TGA2] (rev 22) (prog-if 00 [VGA])
        Flags: bus master, medium devsel, latency 255, IRQ 32
        Memory at 0000000002400000 (32-bit, prefetchable) [size=4M]
        Expansion ROM at 00000000022d0000 [disabled] [size=32K]

0000:00:09.0 SCSI storage controller: QLogic Corp. ISP1020 Fast-wide SCSI (rev 02)
        Flags: bus master, medium devsel, latency 248, IRQ 28
        I/O ports at 9000 [size=256]
        Memory at 00000000022d8000 (32-bit, non-prefetchable) [size=4K]
        Expansion ROM at 00000000022c0000 [disabled] [size=64K]

0000:00:0a.0 Non-VGA unclassified device: Intel Corporation 82375EB/SB PCI to EISA Bridge (rev 15)
        Flags: bus master, medium devsel, latency 248

0000:00:0b.0 Ethernet controller: Digital Equipment Corporation DECchip 21140 [FasterNet] (rev 20)
        Subsystem: Digital Equipment Corporation: Unknown device 500a
        Flags: bus master, medium devsel, latency 255, IRQ 16
        I/O ports at 9480 [size=128]
        Memory at 00000000022de000 (32-bit, non-prefetchable) [size=128]
        Expansion ROM at 0000000002280000 [disabled] [size=256K]

0000:00:0c.0 USB Controller: NEC Corporation USB (rev 43) (prog-if 10 [OHCI])
        Subsystem: NEC Corporation USB
        Flags: bus master, medium devsel, latency 252, IRQ 20
        Memory at 00000000022d9000 (32-bit, non-prefetchable) [size=4K]
        Capabilities: [40] Power Management version 2

0000:00:0c.1 USB Controller: NEC Corporation USB (rev 43) (prog-if 10 [OHCI])
        Subsystem: NEC Corporation USB
        Flags: bus master, medium devsel, latency 252, IRQ 21
        Memory at 00000000022da000 (32-bit, non-prefetchable) [size=4K]
        Capabilities: [40] Power Management version 2

0000:00:0c.2 USB Controller: NEC Corporation USB 2.0 (rev 04) (prog-if 20 [EHCI])
        Subsystem: HaSoTec GmbH: Unknown device 2928
        Flags: bus master, medium devsel, latency 252, IRQ 22
        Memory at 00000000022dc000 (32-bit, non-prefetchable) [size=256]
        Capabilities: [40] Power Management version 2
=============================================================================================

But I have a fix:
Looking through the kernel surces of the different architectures I have seen that almost all architectures use the same irq.c code. In newer kernels (>2.6.8) for example x86, ia64, amd64, powerpc, parisc change to a generic IRQ handler code. The others are not yet changed, others have different IRQ handlers.

Alpha has not yet changed, but also uses the same IRQ code, with some "small" but for this bug important changes. It seems that the code of handle_irq_event() is a little bit outdated (seems to be unmodified since 2.2 kernels!!!), all other architectures changed it since 2.2.

It was a little bit too much work to change alpha to the generic code, but waht helped was copy/paste of the handle_irq_event() code from x86 to alpha. After that it works, the machine ran 40 days with 2.6.8, 90 days with 2.6.10, since september with 2.6.12 and since yesterday with 2.6.14 (all kernels patched with this patch) -- if I did not shut down because of kernel update, the first patched 2.6.8 would sure also run until today :-)

==================================================================================================================
diff -ru
--- arch/alpha/kernel/irq.c     2005-03-02 08:38:18.000000000 +0100
+++ arch/alpha/kernel/irq.c     2005-05-15 23:32:09.000000000 +0200
@@ -79,29 +79,27 @@
        .end            = no_irq_enable_disable,
 };

-int
-handle_IRQ_event(unsigned int irq, struct pt_regs *regs,
-                struct irqaction *action)
+int handle_IRQ_event(unsigned int irq, struct pt_regs *regs,
+                               struct irqaction *action)
 {
-       int status = 1; /* Force the "do bottom halves" bit */
-       int ret;
+       int ret, retval = 0, status = 0;

-       do {
-               if (!(action->flags & SA_INTERRUPT))
-                       local_irq_enable();
-               else
-                       local_irq_disable();
+       if (!(action->flags & SA_INTERRUPT))
+               local_irq_enable();

+       do {
                ret = action->handler(irq, action->dev_id, regs);
                if (ret == IRQ_HANDLED)
                        status |= action->flags;
+               retval |= ret;
                action = action->next;
        } while (action);
+
        if (status & SA_SAMPLE_RANDOM)
                add_interrupt_randomness(irq);
        local_irq_disable();

-       return status;
+       return retval;
 }

 /*
=====================================================================================================================

The patch is also available on the machine itself: http://alpha.thetaphi.de/alpha-irq.patch

The best solution would be to move alpha also to the generic IRQ code (If I have time I would help with that), but this patch helps.

Another person had also this crash, but he said it only happens with udev/hotplug running - so this could be the cause: The old kernel-2.2 code is not compatible with hotplug features. I did not test this because my machine needs hotplug for usb and not all drivers are put to initrd or /etc/modules.




Reply to: