[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: LSI MegaRAID SAS 9240-4i hangs system at boot

On Fri, 18 May 2012 17:47:54 -0500, Stan Hoeppner wrote:

> On 5/18/2012 9:23 AM, Ramon Hofer wrote:
>> Hi all
>> I finally got my LSI 9240-4i and the Intel SAS expander.
>> Unfortunately it prevents the system from booting. I only got this
>> message on the screen:
>> megasas: INIT adapter done
>> hub 4-1:1.0 over-current condition on port 7 hub 4-1:1.0 over-current
>> condition on port 8
> These over-current errors are reported by USB, not megasas.  Unplug all
> of your USB devices until you get everything else running.

Even when I plug out the chassis usb connector and only have the onboard 
usb connectors from the mainboard without connected anything to it the 
message remains.

This is the device 1 on bus 4 right? So it should be ID 1d6b:0002 Linux 
Foundation 2.0 root hub Bus?

>> I also got the over-current messages when the LSI card is removed.
>> Here's the output of lsusb:
>> Bus 004 Device 003: ID 046d:c517 Logitech, Inc. LX710 Cordless Desktop
>> Laser
>> Bus 004 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching
>> Hub Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus
>> 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
>> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 003
>> Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 002 Device
>> 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
> Again, this is because the over-current issue has nothing to with the
> HBA, but the USB subsystem.

Yes this might have nothing to do with the problem. But I still wanted to 
mention it because I didn't know if it's related or not. Or if I should 
worry about it.

Mainboards somethimes say strange things :-)
On my htpc I always have the message cpu fan error probably because I 
have a big passive cooler and use the chassis fans to cool them.
And this was no problems so far too.

>> Nevertheless I think the module for the card should be loaded but then
>> it somehow hangs.
> You're assuming it's the HBA/module hanging the system.  I see no
> evidence of that so far.

I came to that conclusion because when the card is mounted to system 
stops during booting.
When the card is remove the system boots.
There's this over-current problem that could cause something.
And maybe the pci-e slots have to do something with it. But I have 
plugged the lsi card to both pci-e x16 slots on the mainboard but both 
times the system didn't boot.
And the expander only uses the slot to draw it's power.

And I tried to switch the LSI bios off.

These are the things I tried to isolate the problem but unfortunately I 
don't have any other ideas.
I will now thoroughly study the lsi documentary...

>> And after a while there are more messages which I don't understand. I
>> have taken a picture:
>> http://666kb.com/i/c3wf606sc1qkcvgoc.jpg
> It shows that udev is having serious trouble handling one of the USB
> devices.

Yes but only when the lsi card is attached. When it's removed the 
messages don't appear. And I don't even have anything connected to the usb 
ports. Really confusing...
I thought I had the same messages with the Supermicro AOC-SASLP-MV8 
cards :-?
But when I switched to the bpo amd64 kernel it _seemed_ ok.

This is why I hoped with the megaraid module it would be the same.

Btw just left of the Ext. LED connector there's the CR1 LED constantly 
(from the moment the system is powered) blinking with a 1 sec on / 1 sec 
off period. I couldn't find the meaning of this LED in the LSI documents. 
But to be honest I didn't read through the 500 page manual. Which I will 
do now :-)

>> Then there are lots of messages like this:
>> INFO: task modprobe:123 blocked for more than 120 seconds. "echo 0..."
>> disables this message
>> Instead of modprobe:123 also modprobe:124, 125, 126, 127, 135, 137 and
>> kworker/u:1:164, 165 are listed.
> Posting log snippets like this is totally useless.  Please post your
> entire dmesg output to pastebin and provide the link.

I didn't have the idea yesterday that I could use the files under /var/
log. I was only missing the possibility to type dmesg in a terminal when 
the error occurs.

But I have posted some logs in my previous post. I hope these help more.

>> I can enter the BIOS of the card just fine. It detect the disks and by
>> defaults sets jbod option for them. This is fine because I want to use
>> linux RAID.
> Sure, because the card and expander are working properly.

Yes, now I only have to convice the os to accept this :-)

>> May this problem be the same:
>> http://www.spinics.net/lists/raid/msg30359.html Should I try a firmware
>> upgrade?
> Your hang problem seems unrelated to the HBA.  Exhaust all other
> possibilities before attempting a firmware upgrade.  If there is some
> other system level problem, it could botch the FW upgrade and brick the
> card, leaving you in a far worse situation than you are now.
> Post your FW version here.  It's likely pretty recent already.

The FW version is 2.70.04-0862.

I have a little confusion with the versioning from LSI. On their homepage 
[1] they list the firmware name 4.6 - 10M09 P24 as the newest. The 
filename of this file is 
20.10.1-0077_SAS_2008_FW_Image_APP-2.120.244-1482.zip. The starting 
number 20.10.1-0777 is the newest version according to the readme. The 
filename ends with 2.120.244-1482 which seems more in the format the 
version listed in my cards BIOS.
[1] http://www.lsi.com/downloads/Public/MegaRAID%20Common%

>> This card was recommended to me by the list:
>> http://lists.debian.org/debian-user/2012/05/msg00104.html
> Yes, I recommended it.  It's the best card available in its class.

Yes, I'm really thankful for the recommendation.
And somehow I hoped you could jump in and help me :-)
But I didn't know if it's ok to ask you by name.

So thanks already for that too :-)

>> I hope I can get some hints here :-)
> When troubleshooting potential hardware issues, always disconnect
> everything you can to isolate the component you believe may have an
> issue.  If that device still has a problem, work until you resolve that
> problem.  Then add your other hardware back into the system one device
> at a time until you run into the next problem.  Rinse, repeat, until all
> problems are resolved.  Isolating components during testing is the key.
>  This is called "process of elimination" testing--eliminate everything
> but the one device you're currently testing.

Thanks for the advice!
This is what I tried to do. I was at the point where I couldn't 
disconnect anything anymore. Maybe there are ways to further isolate the 
problem which I couldn't figure out myself.

Best regards

Reply to: