Bug#631187: Kernel panics when removing external hard drive

To: Alexander Kurtz <kurtz.alex@googlemail.com>
Cc: Ben Hutchings <ben@decadent.org.uk>, 631187@bugs.debian.org
Subject: Bug#631187: Kernel panics when removing external hard drive
From: Jonathan Nieder <jrnieder@gmail.com>
Date: Tue, 5 Jul 2011 17:51:29 -0500
Message-id: <[🔎] 20110705225129.GA8701@elie>
Reply-to: Jonathan Nieder <jrnieder@gmail.com>, 631187@bugs.debian.org
In-reply-to: <1308736758.3716.12.camel@localhost>
References: <1308647311.4144.14.camel@localhost> <1308710440.3093.249.camel@localhost> <1308736758.3716.12.camel@localhost>

Hi,

Alexander Kurtz wrote:
> On Wed, 2011-06-22 at 03:40 +0100, Ben Hutchings wrote:

>> The panic message shows there was an earlier kernel warning; please can
>> you provide that.
>
> Thanks to netconsole (a really great tool!) I was able to so. The
> attached kernel log starts right before I plug the drive in.
> Surprisingly the kernel didn't crash the first time, but after trying
> again, everything went as expected (see lines 17 and 35).

Sorry for the long silence.  Let's see:

> [ 1421.182657] sd 7:0:0:0: [sdc] Attached SCSI disk
> [ 1454.865926] WARNING! power/level is deprecated; use power/control instead

Seems harmless enough.

> [ 1478.728383] sd 8:0:0:0: [sdc] Attached SCSI disk
> [ 1491.693027] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
> [ 1491.693229] IP: [<ffffffff8118b2e3>] elv_completed_request+0x38/0x47

The panic.

[...]
> [ 1491.696825] Code: 40 74 35 83 7e 44 01 74 04 a8 40 74 2b 83 e0 11 ff c8 0f 95 c0 83 e0 01 48 05 fc 00 00 00 ff 4c 87 04 f6 46 41 04 74 10 48 8b 02 
> [ 1491.696825]  8b 40 48 48 85 c0 74 04 41 58 ff e0 59 c3 48 8d be 80 00 00 
> [ 1491.696825] RIP  [<ffffffff8118b2e3>] elv_completed_request+0x38/0x47

Disassembly, for convenience (following the hints from
Documentation/oops-tracing.txt):

| <+0>:     rex je 0x6008b8 <str+56>
| <+3>:     cmpl   $0x1,0x44(%rsi)
| <+7>:     je     0x60088d <str+13>
| <+9>:     test   $0x40,%al
| <+11>:    je     0x6008b8 <str+56>
| <+13>:    and    $0x11,%eax
| <+16>:    dec    %eax
| <+18>:    setne  %al
| <+21>:    and    $0x1,%eax
| <+24>:    add    $0xfc,%rax
| <+30>:    decl   0x4(%rdi,%rax,4)
| <+34>:    testb  $0x4,0x41(%rsi)
| <+38>:    je     0x6008b8 <str+56>
| <+40>:    mov    (%rdx),%rax
| <+43>:    cmp    %ah,0x40(%rdx)
| <+46>:    rex.W
| <+47>:    test   %rax,%rax
| <+50>:    je     0x6008b8 <str+56>
| <+52>:    pop    %r8
| <+54>:    jmpq   *%rax
| <+56>:    pop    %rcx
| <+57>:    retq   
| <+58>:    lea    0x80(%rsi),%rdi

So offset 0x38 is the jump in

		if ((rq->cmd_flags & REQ_SORTED) &&

As for why that involves an access to the address 0x48: well, that
is beyond my depth.  rq->cmd_flags was already accessed in the check

	if (blk_account_rq(rq))

Maybe the actual cause of the fault is some different instruction and
the instruction pointer is not to be trusted (?).  I suppose if I were
in this situation, I'd sprinkle block/elevator.c::elv_completed_request
with printk calls to be able to witness exactly what happens.

Sorry for the trouble, and hope that helps.
Jonathan

Reply to:

Follow-Ups:
- Bug#631187: Kernel panics when removing external hard drive
  - From: Ben Hutchings <ben@decadent.org.uk>

Prev by Date: Bug#632778: iwlagn: Driver unable to support your firmware API. Driver supports v5, firmware is v0.
Next by Date: Processed: reopening 632734
Previous by thread: Bug#632778: iwlagn: Driver unable to support your firmware API. Driver supports v5, firmware is v0.
Next by thread: Bug#631187: Kernel panics when removing external hard drive
Index(es):
- Date
- Thread