[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#776192: Linux null-pointer deref in 3.16.7-ctk2-1 (was: Bug#776192: upgrade-reports wheezy to jessie boot problem)



Control: tag -1 patch

On Mon, 2015-04-06 at 20:50 +0100, Ben Hutchings wrote:
> Control: tag -1 moreinfo upstream
> Control: forwarded -1 http://thread.gmane.org/gmane.linux.ubuntu.devel.kernel.general/39123/
> 
> On Thu, 2015-04-02 at 01:01 -0700, Bill MacAllister wrote:
> > 
> > --On Sunday, January 25, 2015 11:25:34 AM +0100 Niels Thykier <niels@thykier.net> wrote:
> > 
> > > I have CC'ed the Debian linux maintainers as I noticed your kernel
> > > reports a null pointer deference in the kernel (see below for the
> > > trace).  I have taken the liberty of reassigning it to the linux package
> > > as well.
> > >   @linux maintainers: if you suspect that the null pointer issue is
> > > unrelated to Bills boot problem, please clone the bug and throw the bug
> > > back to upgrade-reports for further analysis.
> > >
> > > Bug link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=776192
> > >
> > > Thanks,
> > > ~Niels
> > 
> > Any news on this problem?  I am still seeing this problem even though
> > we have moved on to 3.16.7-ckt7-1.
> > 
> > I had the thought to look at the kernel modules that support the
> > PERC controller on these Dell systems.  Explicitly specifying the
> > mpt* modules and updating initramfs does not fix the problem.
> > 
> > We have plenty of these 1950s.  I really need to come up with a
> > work around or a solution to this problem.  Any ideas about what
> > I should try next?
> 
> It looks the same as this problem:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705
> http://thread.gmane.org/gmane.linux.ubuntu.devel.kernel.general/39123/
> 
> Can you confirm that the LSI controller (mptsas driver) takes more than
> 30 seconds to scan for devices at boot time when running the wheezy
> (Linux 3.2) kernel?
> 
> If so then we need a change to the udev rules to increase the timeout
> for this driver module.  We also need to fix the driver so that it fails
> cleanly if it still hits the timeout.

Also, does this patch work around the problem?

(Instructions for rebuilding the Debian kernel package:
http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official )

Ben.

-- 
Ben Hutchings
Power corrupts.  Absolute power is kind of neat.
                           - John Lehman, Secretary of the US Navy 1981-1987
From: Oleg Nesterov <oleg@redhat.com>
Subject: Re: please fix FUSION (Was: [v3.13][v3.14][Regression]
	kthread:makekthread_create()killable)
Date: Fri, 21 Mar 2014 19:34:43 +0100
Origin: http://permalink.gmane.org/gmane.linux.kernel/1671312
Bug-Debian: https://bugs.debian.org/776192

On 03/20, Oleg Nesterov wrote:
>
> On 03/20, Joseph Salisbury wrote:
> >
> > There was some testing done with your test kernel.  The data collected
> > while running your kernel is available in the bug report:
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705/comments/58
>
> Joseph, thanks a lot.
>
> I'll try to read the logs tomorrow, but at first glance Tetsuo was right,
> I do not see a "long" sleep in that log.

Yes, it seems that it actually needs > 30 secs. It spends most of the time
(30.13286 seconds) in

	msleep+0x20/0x30
	WaitForDoorbellInt+0x103/0x130 [mptbase]
	WaitForDoorbellReply+0x42/0x220 [mptbase]
	mpt_handshake_req_reply_wait+0x1dc/0x2c0 [mptbase]
	SendPortEnable.constprop.23+0x94/0xc0 [mptbase]

WaitForDoorbellInt() does msleep(1) in a loop. This trace starts at the line
6001, and it is repeated 3792 times, till the line 176686 which apparently
shows the trace of the 2nd WaitForDoorbellInt() in WaitForDoorbellReply().

SendPortEnable:

	if (ioc->ir_firmware || ioc->bus_type == SAS) {
		rc = mpt_handshake_req_reply_wait(ioc, req_sz,
		(u32*)&port_enable, reply_sz, (u16*)&reply_buf,
		300 /*seconds*/, sleepFlag);
	} else {
		rc = mpt_handshake_req_reply_wait(ioc, req_sz,
		(u32*)&port_enable, reply_sz, (u16*)&reply_buf,
		30 /*seconds*/, sleepFlag);
	}

I am wondering which branch calls mpt_handshake_req_reply_wait(), the
else's timeout=30 (passed to the 1st WaitForDoorbellInt) suspiciously
matches the time WaitForDoorbellInt() actually runs... But everything
works fine and at first glance the potential timeout error should be
propogated correctly. So "timeout" is probably 300. And probably this
all is fine.

All I can suggest is the dirty hack for now. User-space should be
changed too, I think, but this is another story.

Tetsuo, what do you think?

Oleg.
---


--- a/drivers/message/fusion/mptsas.c
+++ b/drivers/message/fusion/mptsas.c
@@ -5393,6 +5393,8 @@ static struct pci_driver mptsas_driver = {
 #endif
 };
 
+#include <linux/signal.h>
+
 static int __init
 mptsas_init(void)
 {
@@ -5422,7 +5424,31 @@ mptsas_init(void)
 	mpt_event_register(mptsasDoneCtx, mptsas_event_process);
 	mpt_reset_register(mptsasDoneCtx, mptsas_ioc_reset);
 
-	error = pci_register_driver(&mptsas_driver);
+	{
+		sigset_t full, save;
+		/*
+		 * KILL ME. THIS IS THE DIRTY AND WRONG HACK WAITING FOR THE
+		 * FIX FROM MAINTAINERS.
+		 *
+		 * - This driver needs a lot of time to complete SendPortEnable()
+		 *   but systemd-udevd sends SIGKILL after 30 seconds, see
+		 *   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705
+		 *
+		 *   Probably user-space should be changed, but:
+		 *
+		 * - Since commit 786235eeba0e "kthread: make kthread_create()
+		 *   killable" scsi_host_alloc() becomes killable and this SIGKILL
+		 *   crashes the kernel.
+		 *
+		 *   If scsi_host_alloc() fails mptsas_probe() blindly calls
+		 *   mptscsih_remove() and scsi_remove_host() hits host == NULL.
+		 */
+		sigfillset(&full);
+		sigprocmask(SIG_SETMASK, &full, &save);
+		error = pci_register_driver(&mptsas_driver);
+		sigprocmask(SIG_SETMASK, &save, NULL);
+	}
+
 	if (error)
 		sas_release_transport(mptsas_transport_template);
 

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: