Re: LSI MegaRAID SAS 9240-4i hangs system at boot

To: Ramon Hofer <ramonhofer@bluewin.ch>
Cc: debian <debian-user@lists.debian.org>
Subject: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
From: Stan Hoeppner <stan@hardwarefreak.com>
Date: Tue, 12 Jun 2012 17:30:43 -0500
Message-id: <[🔎] 4FD7C313.7090305@hardwarefreak.com>
Reply-to: stan@hardwarefreak.com
In-reply-to: <[🔎] 20120612154013.4a0ebf96@hoferr-x61s.hofer.rummelring>
References: <jp5m1n$dee$1@dough.gmane.org> <4FB6D19A.6090904@hardwarefreak.com> <jp7st1$iik$2@dough.gmane.org> <4FB7CAF7.8060809@hardwarefreak.com> <jpbc81$moh$1@dough.gmane.org> <4FB9AA5F.8010402@hardwarefreak.com> <20120529140927.10dde651@nb-10114> <4FC57CAC.6020508@hardwarefreak.com> <20120530235221.5ff1725a@hoferr-x61s.hofer.rummelring> <4FC6A149.3040508@hardwarefreak.com> <[🔎] 20120606163652.7cf48dda@hoferr-x61s.hofer.rummelring> <[🔎] 4FD05941.9080908@hardwarefreak.com> <20120607115948.748da94e@hoferr-x61s.hofer.rummelring> <4FD09AB8.50303@hardwarefreak.com> <20120608175412.081c9015@hoferr-x61s.hofer.rummelring> <4FD28CF0.8050201@hardwarefreak.com> <[🔎] 20120610160017.575a6491@hoferr-x61s.hofer.rummelring> <4FD51FF0.2010200@hardwarefreak.com> <[🔎] 20120612154013.4a0ebf96@hoferr-x61s.hofer.rummelring>

On 6/12/2012 8:40 AM, Ramon Hofer wrote:
> On Sun, 10 Jun 2012 17:30:08 -0500
> Stan Hoeppner <stan@hardwarefreak.com> wrote:

>> Try the Wheezy installer.  Try OpenSuSE.  Try Fedora.  If any of these
>> work without lockup we know the problem is Debian 6.  However...
> 
> I didn't do this because it the LSI worked with the Asus mobo and
> Debian squeeze. And because I couldn't install OpenSuSE nor Fedora.
> But I will give it another try...

Your problem may involve more than just the two variables.  The problem
may be mobo+LSI+distro_kernel, not just mobo+LSI.  This is why I
suggested trying to install other distros.

>> Please call LSI support before you attempt any additional
>> BIOS/firmware updates.

Note I stated "call".  You're likely to get more/better
information/assistance speaking to a live person.

> It sounds like the issue is related to the bootstrap, so either to
> resolve the issue you will have to free up the option ROM space or
> limit the number of devices during POST."

This is incorrect advice, as it occurs with the LSI BIOS both enabled
and disabled.  Apparently you didn't convey this in your email.

> This is what you've already told me.
> If I understand it right you already told me to try both: free up the
> option ROM and limit the number of devices, right?

No, this person is not talented.  You only have one HBA with BIOS to
load.  There should be plenty of free memory in the ROM pool area.  This
is the case with any mobo.  The LSI ROM is big, but not _that_ big as to
eat up all available space.  Please don't ask me to explain how option
(i.e. add in card) ROMs are mapped into system memory.  That information
is easily found on Wikipedia and in other places.  My point here is that
the problem isn't related to insufficient space for mapping ROMs.

> You've convinced me: I will mount the expander properly to the case :-)

There are many SAS expander that can only be mounted to the chassis,
such as this one:

http://www.hellotrade.com/astek-corporation/serial-attached-scsi-expanders-sas-expander-add-in-card.html

> Ok understood. RAID arrays containing partitions are bad.

Not necessarily.  It depends on the system.  In your system they'd serve
not purpose, and simply complicate your storage stack.

> Nono, I was aware that I can have several RAID arrays.
> My initial plan was to use four disks with the same size and have
> several RAID5 devices. 

This is what you should do.  I usually recommend RAID10 for many
reasons, but I'm guessing you need more than half of your raw storage
space.  RAID10 eats 1/2 of your disks for redundancy.  It also has the
best performance by far, and the lowest rebuild times by far.  RAID5
eats 1 disk for redundancy, RAID6 eats 2.  Both are very slow compared
to RAID10, and both have long rebuild times which increase severely as
the number of drives in the array increases.  The drive rebuild time for
RAID10 is the same whether your array has 4 disks or 40 disks.

> But Cameleon from the debian list told me to not
> use such big disks (>500 GB) because reshaping takes too long and
> another failure during reshaping will kill the data. So she proposed to
> use 500 GB partitions and RAID6 with them.

I didn't read the post you refer to, but I'm guessing you misunderstood
what Camaleón stated, as such a thing is simply silly.  Running multiple
md arrays on the same set of disks is also silly, and can be detrimental
to performance.  For a deeper explanation of this see my recent posts to
the Linux-RAID list.

If you're more concerned with double drive failure during rebuild (not
RESHAPE as you stated) than usable space, make 4 drive RAID10 arrays or
4 drive RAID6s, again, without partitions, using the command examples I
provided as a guide.

> Is there some documentation why partitions aren't good to use?
> I'd like to learn more :-)

Building md arrays from partitions on disks is a means to an end.  Do
you have an end that requires these means?  If not, don't use
partitions.  The biggest reason to NOT use partitions is misalignment on
advanced format drives.  The partitioning utilities shipped with
Squeeze, AFAIK, don't do automatic alignment on AF drives.

If you misalign the partitions, RAID5/6 performance will drop by a
factor of 4, or more, during RMW operations, i.e. modifying a file or
directory metadata.  The latter case is where you really take the
performance hit as metadata is modified so frequently.  Creating md
arrays from bare AF disks avoids partition misalignment.

There have been dozens, maybe hundreds, of articles and blog posts
covering this issue, so I won't elaborate further.

>>> the moment I have four Samsung HD154UI (1.5 TB) and four WD20EARS (2
>>> TB).
>>
>> You create two 4 drive md RAID5 arrays, one composed of the four
>> identical 1.5TB drives and the other composed of the four identical
>> 2TB drives.  Then concatenate the two arrays together into an md
>> --linear array, similar to this:
>>
>> ~$ mdadm -C /dev/md1 -c 128 -n4 -l5 /dev/sd[abcd]  <-- 2.0TB drives
> 
> May I ask what the -c 128 option means? The mdadm man page says that -c
> is to specify the config file?

Read further down in the "create, build, or grow" section.  Here '-c' is
abbreviated '--chunk'.

> 
>> ~$ mdadm -C /dev/md2 -c 128 -n4 -l5 /dev/sd[efgh]  <-- 1.5TB drives
>> ~$ mdadm -C /dev/md0 -n2 -l linear /dev/md[12]
> 
> This is very interesting. I didn't know that this is possible :-o

It's called 'nested RAID' and it's quite common on large scale storage
systems (dozens to hundreds of drives) where any single array type isn't
suitable for such disk counts.

> Does it work as well with hw RAID devices from the LSI card?

Your LSI card is an HBA with full RAID functions.  It is not however a
full blown RAID card--its ASIC is much lower performance and it has no
cache memory.  For RAID1/10 it's probably a toss up at low disk counts
(4-8).  At higher disk counts, or with parity RAID, md will be faster.
But given your target workloads you'll likely not notice a difference.

> Since you tell me that RAIDs with partitions aren't wise I'm thinking
> about creating hw RAID5 devices with four equally sized disks.

If your drives were enterprise units with ERC/TLER I'd say go for it.
However, you have 8 drives of the "green" persuasion.  Hardware RAID
controllers love to kick drives and mark them as "bad" due to timeouts.
 The WD Green drives in particular park heads at something like 6
seconds, and spin down the motors automatically at something like 30
seconds.  When accessed, they'll exceed the HBA timeout period before
spinning up and responding, and get kicked from the array.

I recommended this card in response to your inquiry about a good HBA for
md RAID.  My recommendation was that you use it in HBA mode, not RAID
mode.  It's not going to work well, if at all, with these drives in RAID
mode.  I thought we already discussed this.  Maybe not.

> The -C option means that mdadm creates a new array with the
> name /dev/md1.

It creates it with the <raiddevice> name you specify.  See above.

> Is it wise to use other names, e.g. /dev/md_2T, /dev/md_1T5
> and /dev/md_main?

The md device file names are mostly irrelevant.  But I believe the names
are limited to 'md' and a minor number of 0-127.  And in fact, I believe
the example i gave above may not work, as I said to create md1 and md2
before md0.  mdadm may require the first array you create to be called md0.

Regardless, you're telling us you want to know which array has which
disks by its name.  If you forget what md0/1/2/etc is made of, simply
run 'mdadm -D /dev/mdX'.

> And is a linear raid array the same as RAID0?

No.  Please see the Wikipedia mdadm page.
http://en.wikipedia.org/wiki/Mdadm

>> Then make a write aligned XFS filesystem on this linear device:
>>
>> ~$ mkfs.xfs -d agcount=11 su=131072,sw=3 /dev/md2
> 
> Are there similar options for jfs?

Dunno.  Never used as XFS is superior in every way.  JFS hasn't seen a
feature release since 2004.  It's been in bug fix only mode for 8 years
now.  XFS has a development team of about 30 people working at all the
major Linux distros, SGI, and IBM, yes, IBM.  It has seen constant
development since it's initial release on IRIX in 1994 and port to Linux
in the early 2000s.

> I decided to use jfs when I set up the old server because it's easier
> to grow the filesystem.

Easier that what?  EXT?

> But when I see the xfs_grow below I'm not sure if xfs wouldn't be the
> better choice. 

It is, but for dozens of more reasons.

> Especially because I read in wikipedia that xfs is
> integrated in the kernel and to use jfs one has to install additional
> packages.

You must have misread something.  The JFS driver was still in mainline
as of 3.2.6, and I'm sure it's still in 3.4 though I've not confirmed
it.  So you can build JFS right into your kernel, or as a module.  I'd
never use it, nor recommend it, I'm just squaring the record.

> Btw it seems very complicated with all the allocation groups, stripe
> units and stripe width.

Powerful flexibility is often accompanied by a steep learning curve.

> How do you calculate these number?

Beginning users don't.  You use the defaults.  You are confused right
now because I lifted the lid and you got a peek inside more advanced
configuations.  Reading the '-d' section of 'man mkfs.xfs' tells you how
to calculate sunit/swidth, su/sw for different array types and chunk sizes.

Please read the following very carefully.  IF you did not want a single
filesystem space across both 4 disk arrays, and the future 12 disks you
may install in that chassis, you CAN format each md array with its own
XFS filesystem using the defaults.  In this case, mkfs.xfs will read the
md geometry and create the array with all the correct
parameters--automatically.  So there's nothing to calculate, no confusion.

However, you don't want 2 or 6 separate filesystems mounted as something
like:

/data1
...
/data6

in your root directory.  You want one big filesystem mounted in your
root as something like '/data' to create subdirs and put files in,
without worrying about how much space you have left in each of 6
filesystems/arrays.  Correct?

The advanced configuration I previously gave you allows for one large
XFS across all your arrays.  mkfs.xfs is not able to map out the complex
storage geometry of nested arrays automatically, which is why I lifted
the lid and showed you the advanced configuration.

With it you'll get a minimum filesystem bandwidth of ~300MB/s per single
file IO and a maximum of ~600MB/s with 2 or more parallel file IOs, with
two 4-drive arrays.  Each additional 4 drive RAID5 array grown into the
md linear array and then into XFS will add ~300MB/s of parallel file
bandwidth, up to a maximum of ~1.5GB/s.  This should far exceed your needs.

> And why do both arrays have a stripe width of 384 KB?

You already know the answer.  You should anyway:

chunk size            = 128KB
RAID level            = 5
No. of disks          = 4
((4-1)=3)) * 128KB    = 384KB

> Is it also true that I will get better performance with two hw RAID5
> arrays?

Assuming for a moment your drives will work in RAID mode with the 9240,
which they won't, the answer is no.  Why?  Your CPU cores are far faster
than the ASIC on the 9240, and the board has no battery backed cache RAM
to offload write barriers.

If you step up to one of the higher end full up RAID boards with BBWC,
and the required enterprise drives, then the answer would be yes up to
the 20 drives your chassis can hold.   As you increase the drive count,
at some point md RAID will overtake any hardware RAID card, as the
533-800MHz single/dual core RAID ASIC just can't keep up with the cores
in the host CPU.

> What if I loose a complete raid5 array which was part of the linear
> raid array? Will I loose the whole content from the linear array as I
> would with lvm?

Answer1:  Are you planning on losing an entire RAID5 array?  Planning,
proper design, and proper sparing prevents this.  If you lose a drive,
replace it and rebuild IMMEDIATELY.  Keep a spare drive on hand, or
better yet in standby.  Want to eliminate this scenario?  Use RAID10 or
RAID6, and live with the lost drive space.  And still replace/rebuild a
dead drive immediately.

Answer2:  It depends.  If this were to happen, XFS will automatically
unmount the filesystem.  At that point you run xfs_repair.  If the array
that died contained the superblock and AG0 you've probably lost
everything.  If it did not, the repair may simply shrink the filesystem
and repair any damaged inodes, leaving you with whatever was stored on
the healthy RAID5 array.

> I'm still aware that 3 TB raid5 rebuilds take long. 

3TB drive rebuilds take forever, period.  As I mentioned, it takes ~8
hours to rebuild a mirror.

> Nevertheless I think
> I will risk using normal (non-green) disks for the next expansion.

What risk?  Using 'normal' drives will tend to reduce RAID related green
drive problems.

> If I'm informed correctly there are not only green drives and normal
> desktop drives but also server disks with a higher quality than
> desktop disks.

Yes, and higher performance.  They're called "enterprise" drives.  There
are many enterprise models: 7.2K SATA/SAS, 10K SATA/SAS, 15K SAS, 2.5"
and 3.5"

> But still I don't want to "waste" energy. 

Manufacturing a single drive consumes as much energy as 4 drives running
for 3 years.  Green type drives tend to last half as long due to all the
stop/start cycles wearing out the spindle bearings.  Do the math.  The
net energy consumption of 'green' drives is therefore equal to or higher
than 'normal' drives.  The only difference is that a greater amount of
power is consumed by the drive before you even buy it.  The same
analysis is true of CFL bulbs.  They consume more total energy through
their life cycle than incandescents.

> Would the Seagate Barracuda
> 3TB disks be a better choise?

Is your 10.5TB full already?  You don't even have the system running yet...

> My needs are probably *much* less demanding than yours.
> Usually it only has to do read access to the files. Aditionally copying
> bluray rips to it. But most of the the it sits around doing nothing
> (the raid). MythTV records almost most of the time but to a non RAID
> disk.
> So I hope with non-green 3 TB disks I can get some security from the
> redundancy and still get a lot of disk space.

If you have a good working UPS, good airflow (that case does), and
decent quality drives, you shouldn't have to worry much.  I'm unsure of
the quality of the 3TB Barracuda, haven't read enough about it.

Are you planning on replacing all your current drives with 4x 3TB
drives?  Or going with the linear over RAID5 architecture I recommended,
and adding 4x 3TB drives into the mix?

> This was exactly what I had in mind at the first place. But the
> suggestion from Cameleon was so tempting :-)

Cameleon helps many people with many Debian/Linux issues and is very
knowledgeable in many areas.  But I don't recall anyone accusing her of
being a storage architect. ;)

> Btw I have another question:
> Is it possible to attach the single (non raid) disk I now have in my old
> server for the mythtv recordings to the LSI controller and still have
> access to the content when it's configured as jbod?
> Since there are recordings which it wouldn't be very bad if I loose
> them I'd like to avoid backing this up.

Drop it in a drive sled, plug it into the backplane, and find out.  If
you configure it for JBOD the LSI shouldn't attempt writing any metadata
to it.

-- 
Stan

Reply to:

Follow-Ups:
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>

References:
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>

Prev by Date: Re: Turn Off Screen Saver
Next by Date: Re: unique install question? (Squeeze audio install using the Wheezy-di from a USB drive)
Previous by thread: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
Next by thread: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
Index(es):
- Date
- Thread