Re: LSI MegaRAID SAS 9240-4i hangs system at boot

To: Ramon Hofer <ramonhofer@bluewin.ch>
Cc: debian <debian-user@lists.debian.org>
Subject: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
From: Stan Hoeppner <stan@hardwarefreak.com>
Date: Thu, 14 Jun 2012 03:29:25 -0500
Message-id: <[🔎] 4FD9A0E5.5050007@hardwarefreak.com>
Reply-to: stan@hardwarefreak.com
In-reply-to: <[🔎] 20120613212203.1e971a3e@hoferr-x61s.hofer.rummelring>
References: <jp5m1n$dee$1@dough.gmane.org> <4FB6D19A.6090904@hardwarefreak.com> <jp7st1$iik$2@dough.gmane.org> <4FB7CAF7.8060809@hardwarefreak.com> <jpbc81$moh$1@dough.gmane.org> <4FB9AA5F.8010402@hardwarefreak.com> <20120529140927.10dde651@nb-10114> <4FC57CAC.6020508@hardwarefreak.com> <20120530235221.5ff1725a@hoferr-x61s.hofer.rummelring> <4FC6A149.3040508@hardwarefreak.com> <[🔎] 20120606163652.7cf48dda@hoferr-x61s.hofer.rummelring> <[🔎] 4FD05941.9080908@hardwarefreak.com> <20120607115948.748da94e@hoferr-x61s.hofer.rummelring> <4FD09AB8.50303@hardwarefreak.com> <20120608175412.081c9015@hoferr-x61s.hofer.rummelring> <4FD28CF0.8050201@hardwarefreak.com> <[🔎] 20120610160017.575a6491@hoferr-x61s.hofer.rummelring> <4FD51FF0.2010200@hardwarefreak.com> <[🔎] 20120612154013.4a0ebf96@hoferr-x61s.hofer.rummelring> <[🔎] 4FD7C313.7090305@hardwarefreak.com> <[🔎] 20120613212203.1e971a3e@hoferr-x61s.hofer.rummelring>

On 6/13/2012 2:22 PM, Ramon Hofer wrote:
> On Tue, 12 Jun 2012 17:30:43 -0500
> Stan Hoeppner <stan@hardwarefreak.com> wrote:

This chain is so long I'm going to liberally snip lots of stuff already
covered.  Hope that's ok.

>> Note I stated "call".  You're likely to get more/better
>> information/assistance speaking to a live person.
> 
> I didn't have enough confidence in my oral english :-(

Understood.  Didn't realize that could be an issue.  Apologies for my
'cultural insensitivity". ;)

>> This is incorrect advice, as it occurs with the LSI BIOS both enabled
>> and disabled.  Apparently you didn't convey this in your email.
> 
> I will write it to them again.
> But to be honest I think I'll leave the Supermicro and use it for my
> Desktop.

If you're happy with an Asus+LSI server and SuperMicro PC, and it all
works the way you want, I'd not bother with further troubleshooting either.

>> Building md arrays from partitions on disks is a means to an end.  Do
>> you have an end that requires these means?  If not, don't use
>> partitions.  The biggest reason to NOT use partitions is misalignment
>> on advanced format drives.  The partitioning utilities shipped with
>> Squeeze, AFAIK, don't do automatic alignment on AF drives.
> 
> Ok, I was just confused because most the tutorials (or at least most of
> the ones I found) use partitions over the whole disk...

Most of the md tutorials were written long before AF drives became
widespread, which has been a relatively recent phenomenon, the last 2
years or so.

It seems md atop partitions is recommended by two classes of users:

1.  Ultra cheap bastards who buy "drive of the week".
2.  Those who want to boot from disks in an md array

I'd rather not fully explain this due to space.  If you reread your
tutorials and other ones, you'll start to understand.

>> If you misalign the partitions, RAID5/6 performance will drop by a
>> factor of 4, or more, during RMW operations, i.e. modifying a file or
>> directory metadata.  The latter case is where you really take the
>> performance hit as metadata is modified so frequently.  Creating md
>> arrays from bare AF disks avoids partition misalignment.
> 
> So if I can make things simpler I'm happy :-)

Simpler is not always better, but it is most of the time.

The only caveat to using md on bare drives is that all members should
ideally be of identical size.  If they're not, md takes the sector count
of the smallest drive and uses that number of sectors on all the others.
 If you try to add a drive later whose sector count is less, it won't
work.  "Drive of the week" buyer applies here. ;)

More savvy users don't add drives to and reshape their arrays.  They add
an entire new array, add it to an existing umbrella linear array, then
grow their XFS filesystem over it.  There is zero downtime or degraded
access to current data with this method.  Reshaping runs for a day or
more and data access, especially writes, is horribly slow during the
process.

Misguided souls who measure their array performance exclusively with
single stream 'dd' reads instead of real workload will balk at this
approach.  They're also the crowd that promotes using md over
partitions.  ;)

> You're right.
> I just had the impression that you'd suggested that I'd use the hw raid
> capability of the lsi at the beginning of this conversation.

I did.  And if you could, you should.  And you did HW RAID with the SM
board, but the Debian kernel locks up.  With the Asus board you can't
seem get into the HBA BIOS to configure HW RAID.  So it's really not an
option now.  The main reason for it is automatic rebuild on failure.
But since you don't have dedicated spare drives that advantage goes out
the window.  So md RAID is fine.

> I must have read outdated wikis (mostly from the mythtv project).

Trust NASA more than MythTV users?  From:
http://www.nas.nasa.gov/hecc/resources/columbia.html

Storage
    Online: DataDirect Networks® and LSI® RAID, 800 TB (raw)
        ...
        Local SGI XFS

That 800TB is carved up into a handful of multi-hundred TB XFS
filesystems.  It's mostly used for scratch space during sim runs.  They
have a multi-petabyte CXFS filesystem for site wide archival storage.
NASA is but one of many sites with multi-hundred TB XFS filesystems
spanning hundreds of disk drives.

IBM unofficially abandoned GFS on Linux, which is why it hasn't seen a
feature release since 2004.  Enhanced JFS, called JFS2, is proprietary,
and is only available on IBM pSeries servers.

MythTV users running JFS are simply unaware of these facts, and use JFS
because it still works for them, and that's great.  Choice and freedom
are good things.  But if they're stating it's better than XFS they're
hitting the crack pipe too often. ;)

> Translated: Since kernel version 2.6 it's an official part of the
> kernel.
> 
> Maybe I misunderstood this sentence in what the writer meant or maybe
> it's even wrong what they wrote in the first place :-?

What they wrote is correct.  JFS has been in Linux mainline since the
release of Linux 2.6, which was ... December 2003, 8.5 years ago.  Then
IBM abandoned Linux JFS not long after.

> Ok if I read it right it divides the array into 11 allocation groups,
> with 131072 byte blocks and 3 stripe units as stripe width.
> But where do you know what numbers to use?
> Maybe I didn't read the man carefully enough then I'd like to
> appologize :-)

'man mkfs.xfs' won't tell you how to calculate how many AGs you need.
mkfs.xfs creates agcount and agsize automatically using an internal
formula unless you manually specify valid values.  Though I can tell you
how it works.  Note: the current max agsize=1TB

1.  Defaults to 4 AGs if the device is < 4TB and not a single level
    md striped array.  This is done with singe disks, linear arrays,
    hardware RAIDs, SANs.  Linux/XFS have no standard interface to
    query hardware RAID device parms.  There's been talk of an
    industry standard interface but no publication/implementation.
    So for hardware RAID may need to set some parms manually for best
    performance.  You can always use mkfs.xfs defaults and it will work.
    You simply don't get all the performance of the hardware.

2.  If device is a single level md striped array, AGs=16, unless the
    device size is > 16TB.  In that case AGs=device_size/1TB.

3.  What 'man mkfs.xfs' does tell you is how to manually configure the
    stripe parms.  It's easy.  You match the underlying RAID parms.
    E.g. 16 drive RAID 10 with 64KB chunk.  RAID 10 has n/2 stripe
    spindles.  16/2 = 8

    ~$ mkfs.xfs -d su=64k,sw=8 /dev/sda

    E.g. 8 drive RAID6 with 128KB chunk.  RAID6 has n-2 stripe
    spindles.  8-2 = 6

    ~$ mkfs.xfs -d su=128k,sw=6 /dev/sda

    E.g. 3 drive RAID5 with 256KB chunk.  RAID5 has n-1 stripe
    spindles.  3-1 = 2

    ~$ mkfs.xfs -d su=256k,sw=2 /dev/sda

The above are basic examples and we're letting mkfs.xfs choose the
number of AGs based on total capacity.  You typically only specify
agcount or agsize manually in advanced configurations when you're tuning
XFS to a storage architecture for a very specific application workload,
such as a high IOPS maildir server.  I've posted examples of this
advanced storage architectures and mkfs.xfs previously on the dovecot
and XFS lists if you care to search for them.  In them I show how to
calculate a custom agcount to precisely match the workload IO pattern to
each disk spindle, using strictly allocation group layout to achieve
full workload concurrency without any disk striping, only mirroring.

>> The advanced configuration I previously gave you allows for one large
>> XFS across all your arrays.  mkfs.xfs is not able to map out the
>> complex storage geometry of nested arrays automatically, which is why
>> I lifted the lid and showed you the advanced configuration.
> 
> Ok, this is very nice!
> But will it also work for any disk size (1.5, 2 and 3 TB drives)?

All of the disks in each md array should to be the same size, preferably
identical disks from the same vendor, for the best outcome.  But each
array can use different size disks, such as what you have now.  One
array of 4x1.5TB, another array of 4x2TB.  Your next array could be
4x1TB or 4x3TB.  You could go with more or fewer drives per array, but
if you do it will badly hose your xfs stripe alignment, and performance
to the new array will be so horrible that you will notice it, big time,
even though you need no performance.  Stick to adding sets of 4 drives
with the same md RAID5 parms and you'll be happy.  Deviate from that,
and you'll be very sad, ask me for help, and then I'll be angry, as it's
impossible to undo this design and start over.  This isn't unique to XFS.

>> chunk size            = 128KB
> 
> This is what I don't know.
> Is this a characteristic of the disk?

No.  I chose this based on your workload description.  The mdadm default
is 64KB.  Different workloads work better with different chunk sizes.
There is no book or table with headings "workload" and "chunk size" to
look at.  People who set a manual chunk/strip size either have a lot of
storage education, either self or formal, or they make an educated
guess--or both.  Multi-streaming video capture to high capacity drives
typically works best with an intermediate strip/chunk size with few
stripe members in the array.  If you had 8 drives per array I'd have
left it at 64KB, the default.  I'm sure you can find many
recommendations on strip/stripe size in the MythTV forums.  They may
vary widely, but if you read enough posts you'll find a rough consensus.
 And it may even contradict what I've recommended.  I've never used
MythTV.  My recommendation is based on general low level IO for
streaming video.

> Just another question: The linear raid will distribute the data to the
> containing raid5 arrays?

Unfortunately you jumped the gun and created your XFS atop a single
array, but with the agcount I gave you for the two arrays combined.  As
I mentioned in a previous reply (which was off list I think), you now
have too many AGs.  To answer your question, the first dir you make is
created in AG0, the second dir in AG1, and so on, until you hit AG11.
The next dir you make will be in AG0 and cycle begins anew.

Since you're copying massive dir counts and files to the XFS, your files
aren't being spread across all 6 drives of two RAID5s.  Once you've
copied all the data over, wipe those 1.5s, create an md RAID5, grow them
into the linear array, and grow XFS, only new dirs and files you create
AFTER the grow operation will be able to hit the new set of 3 disks.  On
top of that, because your agcount is way too high, XFS will continue
creating new dirs and files in the original RAID5 array until it fills
up.  At that point it will write all new stuff to the second RAID5.

This may not be a problem as you said your performance needs are very
low.  But that's not the way I designed it for you.  I was working under
the assumption you would have both RAID5s available from the beginning.
 If that had been so, your dirs/files would have been spread fairly
evenly over all 6 disks of the two RAID5 arrays, and only the 3rd future
array would get an unbalanced share.

> Or will it fill up the first one and continue with the second and so on?

I already mostly answered this above.  Due to what has transpired it
will behave more in this fashion than by the design parameters which
would have given fairly even spread across all disks.

>> Manufacturing a single drive consumes as much energy as 4 drives
>> running for 3 years.  Green type drives tend to last half as long due
>> to all the stop/start cycles wearing out the spindle bearings.  Do
>> the math.  The net energy consumption of 'green' drives is therefore
>> equal to or higher than 'normal' drives.  The only difference is that
>> a greater amount of power is consumed by the drive before you even
>> buy it.  The same analysis is true of CFL bulbs.  They consume more
>> total energy through their life cycle than incandescents.
> 
> Hmm, I knew that for hybrid cars but never thought about this for
> hdds.

Take a tour with me...

Drive chassis are made from cast aluminum ingots with a CNC machine
Melting point of Al is 660 °C

Drive platters are made of glass and aluminum, and coated with a
specially formulated magnetic film.
Melting point of Si is 1400 °C

It takes a tremendous amount of natural gas or electricity-depending on
the smelting furnace type--to generate the 660 °C and 1400 °C temps
needed to melt these materials.  Then you burn the fuel to ship the
ingots and platters from the foundries to the drive factories, possibly
an overseas trip.  Then you have all the electricity consumed by the
milling machines, stamping/pressing machines, screw guns, etc.  Then we
have factory lighting, air conditioning, particle filtration systems,
etc.  Then we have fuel consumed by the cars and buses that transport
the workforce to and from the factory, the trucks that drive the pallets
of finished HDDs to the port, the cranes that load them on the ships,
and the fuel the ships burn bringing the drives from Thailand,
Singapore, and China to Northern Europe and the US.

As with any manufacturing, there is much energy consumption involved.

>>> Would the Seagate Barracuda
>>> 3TB disks be a better choise?
>>
>> Is your 10.5TB full already?  You don't even have the system running
>> yet...
> 
> No, but I like living in the future ;-)

It may be 2-3 years before you need new drives.  All the current models
and their reputations will have changed by then.  Ask a few months
before your next drive purchase.  Now is too early.

> I'm planning to keep the drives I have now and add 4x 3TB into the mix.

The more flashing LEDs the better. :)  Fill-r-up.

> Her suggestion seemed very tempting because it would give me a raid6
> without having to loose too much storage space.
> She really knows a lot so I was just happy with her suggesting me this
> setup.

That's the "ghetto" way of getting what you wanted.  And there are many
downsides to it.  Which is why I suggested a much better, more sane, way.

> Again thanks alot for all your help and your patience with me.
> Certainly not always easy ;-)

You're very welcome Ramon.

Nah, if I've seemed short of frustrated at times, that's my fault, not
yours.  I'm glad to help.  And besides, I was obliged to because the
unique hardware combo I personally, specifically, recommended wasn't
working for you, in a mobo it *should* work in.  And I recommended the
md linear over stripe w/XFS, and nobody here would have been able to
help you with that either.  But in the end you're going to have
something unique, fits your needs, performs well enough, and is built
from best of breed hardware and software.  So you can be proud of this
solution, show it off it you have like minded friends.  And you can tell
them you're running what NASA supercomputers run. ;)

Sorry I didn't meet my goal of making this shorter than previous replies. ;)

-- 
Stan

Reply to:

Follow-Ups:
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>

References:
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>

Prev by Date: Re: books on debian of a beginner nature?
Next by Date: Re: unique install question? (Squeeze audio install using the Wheezy-di from a USB drive)
Previous by thread: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
Next by thread: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
Index(es):
- Date
- Thread