Re: LSI MegaRAID SAS 9240-4i hangs system at boot

To: stan@hardwarefreak.com
Cc: debian <debian-user@lists.debian.org>
Subject: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
From: Ramon Hofer <ramonhofer@bluewin.ch>
Date: Thu, 14 Jun 2012 15:02:12 +0200
Message-id: <[🔎] 20120614150212.647f14dd@nb-10114>
In-reply-to: <[🔎] 4FD9A0E5.5050007@hardwarefreak.com>
References: <jp5m1n$dee$1@dough.gmane.org> <4FB6D19A.6090904@hardwarefreak.com> <jp7st1$iik$2@dough.gmane.org> <4FB7CAF7.8060809@hardwarefreak.com> <jpbc81$moh$1@dough.gmane.org> <4FB9AA5F.8010402@hardwarefreak.com> <20120529140927.10dde651@nb-10114> <4FC57CAC.6020508@hardwarefreak.com> <20120530235221.5ff1725a@hoferr-x61s.hofer.rummelring> <4FC6A149.3040508@hardwarefreak.com> <[🔎] 20120606163652.7cf48dda@hoferr-x61s.hofer.rummelring> <[🔎] 4FD05941.9080908@hardwarefreak.com> <20120607115948.748da94e@hoferr-x61s.hofer.rummelring> <4FD09AB8.50303@hardwarefreak.com> <20120608175412.081c9015@hoferr-x61s.hofer.rummelring> <4FD28CF0.8050201@hardwarefreak.com> <[🔎] 20120610160017.575a6491@hoferr-x61s.hofer.rummelring> <4FD51FF0.2010200@hardwarefreak.com> <[🔎] 20120612154013.4a0ebf96@hoferr-x61s.hofer.rummelring> <[🔎] 4FD7C313.7090305@hardwarefreak.com> <[🔎] 20120613212203.1e971a3e@hoferr-x61s.hofer.rummelring> <[🔎] 4FD9A0E5.5050007@hardwarefreak.com>

On Thu, 14 Jun 2012 03:29:25 -0500
Stan Hoeppner <stan@hardwarefreak.com> wrote:

> On 6/13/2012 2:22 PM, Ramon Hofer wrote:
> > On Tue, 12 Jun 2012 17:30:43 -0500
> > Stan Hoeppner <stan@hardwarefreak.com> wrote:
> 
> This chain is so long I'm going to liberally snip lots of stuff
> already covered.  Hope that's ok.

Sure. Your mail still blew my mind :-)


> >> This is incorrect advice, as it occurs with the LSI BIOS both
> >> enabled and disabled.  Apparently you didn't convey this in your
> >> email.
> > 
> > I will write it to them again.
> > But to be honest I think I'll leave the Supermicro and use it for my
> > Desktop.
> 
> If you're happy with an Asus+LSI server and SuperMicro PC, and it all
> works the way you want, I'd not bother with further troubleshooting
> either.

Well the only differences are:

1. Can't enter the LSI BIOS to set up hw raid which I don't need to. So
no problem.

2. I can't see the network activity leds in the front of the case.
Which is a gadget I don't really need. If there are problems I can look
at the mobo leds if there's lan activity. So no problem too.


> >> Building md arrays from partitions on disks is a means to an end.
> >> Do you have an end that requires these means?  If not, don't use
> >> partitions.  The biggest reason to NOT use partitions is
> >> misalignment on advanced format drives.  The partitioning
> >> utilities shipped with Squeeze, AFAIK, don't do automatic
> >> alignment on AF drives.
> > 
> > Ok, I was just confused because most the tutorials (or at least
> > most of the ones I found) use partitions over the whole disk...
> 
> Most of the md tutorials were written long before AF drives became
> widespread, which has been a relatively recent phenomenon, the last 2
> years or so.

AF drives are Advanced Format drives with more than 512 bytes per
sector right?


> > I must have read outdated wikis (mostly from the mythtv project).
> 
> Trust NASA more than MythTV users?  From:
> http://www.nas.nasa.gov/hecc/resources/columbia.html

I don't trust anybody ;-)


> Storage
>     Online: DataDirect Networks® and LSI® RAID, 800 TB (raw)
>         ...
>         Local SGI XFS
> 
> That 800TB is carved up into a handful of multi-hundred TB XFS
> filesystems.  It's mostly used for scratch space during sim runs.
> They have a multi-petabyte CXFS filesystem for site wide archival
> storage. NASA is but one of many sites with multi-hundred TB XFS
> filesystems spanning hundreds of disk drives.
> 
> IBM unofficially abandoned GFS on Linux, which is why it hasn't seen a
> feature release since 2004.  Enhanced JFS, called JFS2, is
> proprietary, and is only available on IBM pSeries servers.
> 
> MythTV users running JFS are simply unaware of these facts, and use
> JFS because it still works for them, and that's great.  Choice and
> freedom are good things.  But if they're stating it's better than XFS
> they're hitting the crack pipe too often. ;)

Here's what I was referring to:
http://www.mythtv.org/docs/mythtv-HOWTO-3.html

"Filesystems

MythTV creates large files, many in excess of 4GB. You must use a 64 or
128 bit filesystem. These will allow you to create large files.
Filesystems known to have problems with large files are FAT (all
versions), and ReiserFS (versions 3 and 4).

Because MythTV creates very large files, a filesystem that does well at
deleting them is important. Numerous benchmarks show that XFS and JFS
do very well at this task. You are strongly encouraged to consider one
of these for your MythTV filesystem. JFS is the absolute best at
deletion, so you may want to try it if XFS gives you problems. MythTV
incorporates a "slow delete" feature, which progressively shrinks the
file rather than attempting to delete it all at once, so if you're more
comfortable with a filesystem such as ext3 (whose delete performance
for large files isn't that good) you may use it rather than one of the
known-good high-performance file systems. There are other ramifications
to using XFS and JFS - neither offer the opportunity to shrink a
filesystem; they may only be expanded.

NOTE: You must not use ReiserFS v3 for your recordings. You will get
corrupted recordings if you do.

Because of the size of the MythTV files, it may be useful to plan for
future expansion right from the beginning. If your case and power
supply have the capacity for additional hard drives, read through the
Advanced Partition Formatting sections for some pointers."


So they say it's about the same. But this page must be at least some
years old without any changes at least in this paragraph.

I additionally found a foum post from four years ago were someone
states that xfs has problems with interrupted power supply:
http://www.linuxquestions.org/questions/linux-general-1/xfs-or-jfs-685745/#post3352854

"I only advise XFS if you have any means to guarantee uninterrupted
power supply. It's not the most resistant fs when it comes to power
outages."

I usually don't have blackouts. At least as long that the PC turn off.
But I don't have a UPS.


> > Ok if I read it right it divides the array into 11 allocation
> > groups, with 131072 byte blocks and 3 stripe units as stripe width.
> > But where do you know what numbers to use?
> > Maybe I didn't read the man carefully enough then I'd like to
> > appologize :-)
> 
> 'man mkfs.xfs' won't tell you how to calculate how many AGs you need.
> mkfs.xfs creates agcount and agsize automatically using an internal
> formula unless you manually specify valid values.  Though I can tell
> you how it works.  Note: the current max agsize=1TB

This is very interesting. I hope I get everything right :-)


> 1.  Defaults to 4 AGs if the device is < 4TB and not a single level
>     md striped array.  This is done with singe disks, linear arrays,
>     hardware RAIDs, SANs.  Linux/XFS have no standard interface to
>     query hardware RAID device parms.  There's been talk of an
>     industry standard interface but no publication/implementation.
>     So for hardware RAID may need to set some parms manually for best
>     performance.  You can always use mkfs.xfs defaults and it will
> work. You simply don't get all the performance of the hardware.

I will get better performance if I have the correct parameters.


> 2.  If device is a single level md striped array, AGs=16, unless the
>     device size is > 16TB.  In that case AGs=device_size/1TB.

A single level md striped array is any linux raid containing disks.
Like my raid5.
In contrast would be my linear raid containing one or more raids?


> 3.  What 'man mkfs.xfs' does tell you is how to manually configure the
>     stripe parms.  It's easy.  You match the underlying RAID parms.
>     E.g. 16 drive RAID 10 with 64KB chunk.  RAID 10 has n/2 stripe
>     spindles.  16/2 = 8
> 
>     ~$ mkfs.xfs -d su=64k,sw=8 /dev/sda
> 
>     E.g. 8 drive RAID6 with 128KB chunk.  RAID6 has n-2 stripe
>     spindles.  8-2 = 6
> 
>     ~$ mkfs.xfs -d su=128k,sw=6 /dev/sda
> 
>     E.g. 3 drive RAID5 with 256KB chunk.  RAID5 has n-1 stripe
>     spindles.  3-1 = 2
> 
>     ~$ mkfs.xfs -d su=256k,sw=2 /dev/sda
> 
> The above are basic examples and we're letting mkfs.xfs choose the
> number of AGs based on total capacity.  You typically only specify
> agcount or agsize manually in advanced configurations when you're
> tuning XFS to a storage architecture for a very specific application
> workload, such as a high IOPS maildir server.  I've posted examples
> of this advanced storage architectures and mkfs.xfs previously on the
> dovecot and XFS lists if you care to search for them.  In them I show
> how to calculate a custom agcount to precisely match the workload IO
> pattern to each disk spindle, using strictly allocation group layout
> to achieve full workload concurrency without any disk striping, only
> mirroring.

Ok, the chunck (=stripe) size is already set 128 kB when creating the
raid5 with the command you provided earlier:

~$ mdadm -C /dev/md1 -c 128 -n4 -l5 /dev/sd[abcd]

Then the mkfs.xfs parameters are adapted to this.


> >> The advanced configuration I previously gave you allows for one
> >> large XFS across all your arrays.  mkfs.xfs is not able to map out
> >> the complex storage geometry of nested arrays automatically, which
> >> is why I lifted the lid and showed you the advanced configuration.
> > 
> > Ok, this is very nice!
> > But will it also work for any disk size (1.5, 2 and 3 TB drives)?
> 
> All of the disks in each md array should to be the same size,
> preferably identical disks from the same vendor, for the best
> outcome.  But each array can use different size disks, such as what
> you have now.  One array of 4x1.5TB, another array of 4x2TB.  Your
> next array could be 4x1TB or 4x3TB.  You could go with more or fewer
> drives per array, but if you do it will badly hose your xfs stripe
> alignment, and performance to the new array will be so horrible that
> you will notice it, big time, even though you need no performance.
> Stick to adding sets of 4 drives with the same md RAID5 parms and
> you'll be happy.  Deviate from that, and you'll be very sad, ask me
> for help, and then I'll be angry, as it's impossible to undo this
> design and start over.  This isn't unique to XFS.

I'll try not to make you angry :-)


> >> chunk size            = 128KB
> > 
> > This is what I don't know.
> > Is this a characteristic of the disk?
> 
> No.  I chose this based on your workload description.  The mdadm
> default is 64KB.  Different workloads work better with different
> chunk sizes. There is no book or table with headings "workload" and
> "chunk size" to look at.  People who set a manual chunk/strip size
> either have a lot of storage education, either self or formal, or
> they make an educated guess--or both.  Multi-streaming video capture
> to high capacity drives typically works best with an intermediate
> strip/chunk size with few stripe members in the array.  If you had 8
> drives per array I'd have left it at 64KB, the default.  I'm sure you
> can find many recommendations on strip/stripe size in the MythTV
> forums.  They may vary widely, but if you read enough posts you'll
> find a rough consensus. And it may even contradict what I've
> recommended.  I've never used MythTV.  My recommendation is based on
> general low level IO for streaming video.

Ok, cool!
Probably some time I will understand how to choose chunck sizes. In the
meantime I will just be happy with the number you provided :-)

Btw: I wasn't clear about mythtv. For the recordings I don't use the
raid. I have another disk just for it.
Everyone recommends to not use raids for the recordings. But to be
honest I don't remember the reaosn anymore :-(

The raid is used for my music and video collection. Of course
everything is owned by me and backed up to disk.

And I also use the raid for backups of the mythtv database and many
other backups.

But far by most it's used to stream multimedia content. So the backups
can be neglected.


> > Just another question: The linear raid will distribute the data to
> > the containing raid5 arrays?
> 
> Unfortunately you jumped the gun and created your XFS atop a single
> array, but with the agcount I gave you for the two arrays combined.
> As I mentioned in a previous reply (which was off list I think), you
> now have too many AGs.  To answer your question, the first dir you
> make is created in AG0, the second dir in AG1, and so on, until you
> hit AG11. The next dir you make will be in AG0 and cycle begins anew.
> 
> Since you're copying massive dir counts and files to the XFS, your
> files aren't being spread across all 6 drives of two RAID5s.  Once
> you've copied all the data over, wipe those 1.5s, create an md RAID5,
> grow them into the linear array, and grow XFS, only new dirs and
> files you create AFTER the grow operation will be able to hit the new
> set of 3 disks.  On top of that, because your agcount is way too
> high, XFS will continue creating new dirs and files in the original
> RAID5 array until it fills up.  At that point it will write all new
> stuff to the second RAID5.
> 
> This may not be a problem as you said your performance needs are very
> low.  But that's not the way I designed it for you.  I was working
> under the assumption you would have both RAID5s available from the
> beginning. If that had been so, your dirs/files would have been
> spread fairly evenly over all 6 disks of the two RAID5 arrays, and
> only the 3rd future array would get an unbalanced share.

This may really be no problem. But when I have an expert at hand and
starting the storage from scratch I want to do it right :-)

I stopped the copy process and will create the xfs again with the
correct number of ags. Would 6 be a good number for the linear array
containing the one raid5 with 4x 2TB disks?

The xfs seems really intelligent. So it spreads the load if it can but
it won't copy everything around when a new disk or in my case raid5 is
added?


> >> Manufacturing a single drive consumes as much energy as 4 drives
> >> running for 3 years.  Green type drives tend to last half as long
> >> due to all the stop/start cycles wearing out the spindle
> >> bearings.  Do the math.  The net energy consumption of 'green'
> >> drives is therefore equal to or higher than 'normal' drives.  The
> >> only difference is that a greater amount of power is consumed by
> >> the drive before you even buy it.  The same analysis is true of
> >> CFL bulbs.  They consume more total energy through their life
> >> cycle than incandescents.
> > 
> > Hmm, I knew that for hybrid cars but never thought about this for
> > hdds.
> 
> Take a tour with me...
> 
> Drive chassis are made from cast aluminum ingots with a CNC machine
> Melting point of Al is 660 °C
> 
> Drive platters are made of glass and aluminum, and coated with a
> specially formulated magnetic film.
> Melting point of Si is 1400 °C
> 
> It takes a tremendous amount of natural gas or electricity-depending
> on the smelting furnace type--to generate the 660 °C and 1400 °C temps
> needed to melt these materials.  Then you burn the fuel to ship the
> ingots and platters from the foundries to the drive factories,
> possibly an overseas trip.  Then you have all the electricity
> consumed by the milling machines, stamping/pressing machines, screw
> guns, etc.  Then we have factory lighting, air conditioning, particle
> filtration systems, etc.  Then we have fuel consumed by the cars and
> buses that transport the workforce to and from the factory, the
> trucks that drive the pallets of finished HDDs to the port, the
> cranes that load them on the ships, and the fuel the ships burn
> bringing the drives from Thailand, Singapore, and China to Northern
> Europe and the US.
> 
> As with any manufacturing, there is much energy consumption involved.

This is very convincing.
But I thought that a green drive lives at least as long as a normal
drive or even longer because it *should* wear less because it's more
often asleep. If this assumption would have been correct than the same
amount of energy would have been used to produce the disks but during
operation less energy would have been used and they would have had to
be replaced fewer so the production energy would have had to be spent
more unfrequently.
If all of this would have been true than I would be willing to pay the
price of less performance and higher raid problem rate.

But I believe you that the disks don't live as long as normal drives.
So everything is different and I won't buy green drives again :-)


> >>> Would the Seagate Barracuda
> >>> 3TB disks be a better choise?
> >>
> >> Is your 10.5TB full already?  You don't even have the system
> >> running yet...
> > 
> > No, but I like living in the future ;-)
> 
> It may be 2-3 years before you need new drives.  All the current
> models and their reputations will have changed by then.  Ask a few
> months before your next drive purchase.  Now is too early.

True.
I will gladly do :-)


> > I'm planning to keep the drives I have now and add 4x 3TB into the
> > mix.
> 
> The more flashing LEDs the better. :)  Fill-r-up.

Maybe I will solder a flash light for the LAN LEDs in the front of the
case too :-D


> > Again thanks alot for all your help and your patience with me.
> > Certainly not always easy ;-)
> 
> You're very welcome Ramon.
> 
> Nah, if I've seemed short of frustrated at times, that's my fault, not
> yours.  I'm glad to help.  And besides, I was obliged to because the
> unique hardware combo I personally, specifically, recommended wasn't
> working for you, in a mobo it *should* work in.  And I recommended the
> md linear over stripe w/XFS, and nobody here would have been able to
> help you with that either.  But in the end you're going to have
> something unique, fits your needs, performs well enough, and is built
> from best of breed hardware and software.  So you can be proud of this
> solution, show it off it you have like minded friends.  And you can
> tell them you're running what NASA supercomputers run. ;)
> 
> Sorry I didn't meet my goal of making this shorter than previous
> replies. ;)

I never was frustrated because of your help. If I was a unhappy it was
only because of my missing knowledge and luck.

If you weren't here to suggest things and help me I would have ended up
with a case that I couldn't use in the worst case. Or one that eats my
data (because of the Supermicro AOC-SASLP-MV8 controllers I initially
had).

In the end I'm very happy and proud of my system. Of course I show it
to my friends and they are jealous for sure :-)


So thanks very much again and please let me know how I can buy you a
beer or two!


Cheers
Ramon

Reply to:

Follow-Ups:
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Stan Hoeppner <stan@hardwarefreak.com>

References:
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Stan Hoeppner <stan@hardwarefreak.com>

Prev by Date: dd_rescue No additional sense
Next by Date: Re: dd_rescue No additional sense
Previous by thread: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
Next by thread: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
Index(es):
- Date
- Thread