Re: LSI MegaRAID SAS 9240-4i hangs system at boot

To: Ramon Hofer <ramonhofer@bluewin.ch>
Cc: debian <debian-user@lists.debian.org>
Subject: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
From: Stan Hoeppner <stan@hardwarefreak.com>
Date: Fri, 15 Jun 2012 12:41:41 -0500
Message-id: <[🔎] 4FDB73D5.90305@hardwarefreak.com>
Reply-to: stan@hardwarefreak.com
In-reply-to: <[🔎] 20120614150212.647f14dd@nb-10114>
References: <jp5m1n$dee$1@dough.gmane.org> <4FB6D19A.6090904@hardwarefreak.com> <jp7st1$iik$2@dough.gmane.org> <4FB7CAF7.8060809@hardwarefreak.com> <jpbc81$moh$1@dough.gmane.org> <4FB9AA5F.8010402@hardwarefreak.com> <20120529140927.10dde651@nb-10114> <4FC57CAC.6020508@hardwarefreak.com> <20120530235221.5ff1725a@hoferr-x61s.hofer.rummelring> <4FC6A149.3040508@hardwarefreak.com> <[🔎] 20120606163652.7cf48dda@hoferr-x61s.hofer.rummelring> <[🔎] 4FD05941.9080908@hardwarefreak.com> <20120607115948.748da94e@hoferr-x61s.hofer.rummelring> <4FD09AB8.50303@hardwarefreak.com> <20120608175412.081c9015@hoferr-x61s.hofer.rummelring> <4FD28CF0.8050201@hardwarefreak.com> <[🔎] 20120610160017.575a6491@hoferr-x61s.hofer.rummelring> <4FD51FF0.2010200@hardwarefreak.com> <[🔎] 20120612154013.4a0ebf96@hoferr-x61s.hofer.rummelring> <[🔎] 4FD7C313.7090305@hardwarefreak.com> <[🔎] 20120613212203.1e971a3e@hoferr-x61s.hofer.rummelring> <[🔎] 4FD9A0E5.5050007@hardwarefreak.com> <[🔎] 20120614150212.647f14dd@nb-10114>

On 6/14/2012 8:02 AM, Ramon Hofer wrote:

> AF drives are Advanced Format drives with more than 512 bytes per
> sector right?

Correct.  Advanced Format is the industry wide name chosen for drives
that have 4096B physical sectors, but present 512B sectors at the
interface level, doing translation internally, "transparently".

> I don't trust anybody ;-)

Good for you! :)

> Here's what I was referring to:
> http://www.mythtv.org/docs/mythtv-HOWTO-3.html

> JFS is the absolute best at
> deletion, so you may want to try it if XFS gives you problems.

Interesting.  Lets see:

~$ time dd if=/dev/zero of=myth-test bs=8192 count=512000
512000+0 records in
512000+0 records out
4194304000 bytes (4.2 GB) copied, 50.1455 s, 83.6 MB/s

real    0m50.167s
user    0m1.560s
sys     0m43.915s

-rw-r--r--  1 root root 4.0G Jun 15 04:52 myth-test

~$ echo 3 > /proc/sys/vm/drop_caches
~$ time rm myth-test; sync

real    0m0.027s
user    0m0.000s
sys     0m0.004s

XFS and the kernel block layer required 4ms to perform the 4GB file
delete.  The disk access required 23ms.  What does this say about the
JFS claim?  I simply don't get the "if XFS gives you problems" bit.  The
author was obviously nothing close to a filesystem expert.

> 
> I additionally found a foum post from four years ago were someone
> states that xfs has problems with interrupted power supply:
> http://www.linuxquestions.org/questions/linux-general-1/xfs-or-jfs-685745/#post3352854

"I found a forum post from 4 years ago"

Myths, lies, and fairy tales.  There was an XFS bug related to power
fail that was fixed over a year before this forum post was made.  Note
that nobody in that thread posts anything from the authoritative source,
as I do here?

http://www.xfs.org/index.php/XFS_FAQ#Q:_Why_do_I_see_binary_NULLS_in_some_files_after_recovery_when_I_unplugged_the_power.3F

> "I only advise XFS if you have any means to guarantee uninterrupted
> power supply. It's not the most resistant fs when it comes to power
> outages."

I advise using a computer only if you have a UPS, no matter what
filesystem you use.  It's incredibe that this guy would make such a
statement, instead of promoting the use of UPS devices.  Abrupt power
loss, or worse, voltage "bumping" which often accompanies brown
conditions, is not good for any computer equipment, especially PSUs and
mechanical hard drives, regardless of what filesystem one uses.

The only data lost due to power failure is inflight write data.  The
vast majority of that is going to be due to Linux buffer cache.  No
matter what FS you use, if you're writing, especially a large file, when
power dies the write has failed and you've lost that file.  EXT3 was a
bit more "resilient" to power loss because of a bug, not a design goal.
 The same bug caused horrible performance with some workloads because of
the excessive hard coded syncs.

> I usually don't have blackouts. At least as long that the PC turn off.
> But I don't have a UPS.

Get one.  Best investment you'll ever make computer-wise.  For your
Norco, we'll assume all 20 bays are filled for sizing purposes.   One of
these should be large enough to run your server and your desktop:

http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=BR900G-GR&total_watts=200
http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=BR900GI&total_watts=200

(Sorry if I mis guessed your native language as German instead of French
or Italian)  I listed both units as I don't know which power plug
configuration you need.

If these UPS seem expensive, consider the fact that they may continue
working for 20+ years.  I bought my home office APC SU1400RMNET used in
2003 for US $250 ($1000+ new) after it had been in corporate service for
3 years on lease.  It's at least 12 years old and I've been running it
for 9 years continuously.  I've replaced the batteries ($80) twice,
about every 4 years.  Buying this unit used, at a steal of a price, is
one of the best investments I ever made.  I expect it to last at least
another 8 years, if not more.

> I will get better performance if I have the correct parameters.

Yes.

> 
>> 2.  If device is a single level md striped array, AGs=16, unless the
>>     device size is > 16TB.  In that case AGs=device_size/1TB.
> 
> A single level md striped array is any linux raid containing disks.
> Like my raid5.

I use "single level" simply to differentiate from a nested array, which
is multi-level.

> In contrast would be my linear raid containing one or more raids?

This is called a "nested" array.  The term comes from "nested loop" in
programming.

> Ok, the chunck (=stripe) 

chunk = "strip", not "stripe"

"Chunk" and "strip" are two words for the same thing.  Linux md uses the
term "chunk".  LSI and other hardware vendors use the term "strip".
They describe the amount of data written to an individual array disk
during a striped write operation.  Stripe is equal to all of the
chunks/strips added together.

E.g.  A 16 disk RAID10 has 8 stripe spindles (8 are mirrors).  Each
spindle has a chunk/strip size of 64KB.  8*64KB = 512KB.  So the
"stripe" size is 512KB.

> size is already set 128 kB when creating the
> raid5 with the command you provided earlier:
> 
> ~$ mdadm -C /dev/md1 -c 128 -n4 -l5 /dev/sd[abcd]
> 
> Then the mkfs.xfs parameters are adapted to this.

Correct.  If you were just doing a single level RAID5 array, and not
nesting it into a linear array, mkfs.xfs would read the md RAID5 parms
and do all of this stuff automatically.  It doesn't if you nest a linear
array on top, as we have.

> I'll try not to make you angry :-)

I'm not Bruce Banner, so don't worry. ;)

> Ok, cool!
> Probably some time I will understand how to choose chunck sizes. In the
> meantime I will just be happy with the number you provided :-)

For your target workloads, finding the "perfect" chunk size isn't
critical.  What is critical is aligning XFS to the array geometry, and
the array to the AF disk geometry, which is, again, why I recommended
using bare disks, no partitions.

> Btw: I wasn't clear about mythtv. For the recordings I don't use the
> raid. I have another disk just for it.
> Everyone recommends to not use raids for the recordings. But to be
> honest I don't remember the reaosn anymore :-(

I've never used MythTV, but it probably has to do with the fact that
most MythTV users have 3-4 slow green SATA drives on mobo SATA ports
using md RAID5 with the default CFQ elevator.  Not a great combo for
doing multiple concurrent read/write A/V streams.

Using a $300-400 USD 4-8 port RAID controller with 512MB write cache,
4-8 enterprise 7.2k SATA drives in RAID5, and the noop or deadline
elevator allows one to do multiple easily.  So does using twice as many
7.2k drives in software RAID10 with deadline.  Both are far more
expensive than simply adding one standalone drive for recording.

>> On top of that, because your agcount is way too
>> high, XFS will continue creating new dirs and files in the original
>> RAID5 array until it fills up.  At that point it will write all new
>> stuff to the second RAID5.

I should have been more clear above.  Directories and files would be
written to AGs on *both* RAID%s until the first one filled up, then
everything would go to AGs on the 2nd RAID5.  Above it sounds like the
2nd RAID5 wouldn't be used until the first one filled up, and that's not
the case.

> The xfs seems really intelligent. So it spreads the load if it can but
> it won't copy everything around when a new disk or in my case raid5 is
> added?

Correct.  But it's not "spreading the load".  It's simply distributing
new directory creation across all available AGs in a round robin
fashion.  When you grow the XFS, it creates new AGs on the new disk
device.  After that it simply does what it always does, distributing new
directory creation across all AGs until some AGs fill up.  This behavior
is more static that adaptive, so it's not really all that intelligent.
The design is definitely intelligent, and it's one of the primary
reasons XFS has such great parallel performance.

> But I thought that a green drive lives at least as long as a normal

With the first series of WD Green drives this wasn't the case.  They had
a much higher failure rate.  Newer generations are probably much better.
 And, all of the manufacturers are adding smart power management
features to most of their consumer drive lines.

> drive or even longer because it *should* wear less because it's more
> often asleep. 

The problem is what is called "thermal cycling".  When the spindle motor
is spinning the platters at 5-7K RPMS and then shuts down for 30 seconds
or more, and then spins up again, the bearings expand and shrink, expand
and shrink, very slightly, fractions of a millimeter.  But this is
enough to cause premature bearing wobble, which affects head flying
height, and thus problems with reads/writes, yielding sector errors (bad
blocks).  This excess bearing wear over time can cause the drive to fail
prematurely if the heads begin impacting the platter surface, which is
common when bearings develop sufficient wobble.

Long before many people on this list were born, systems managers
discovered that drives lasted much longer if left running 24x7x365,
which eliminated thermal cycling.  It's better for drives to run "hot"
all the time than to power them down over night and up the next day.  15
years ago, constant running would extend drive live by up to 5 years.
With the tighter tolerances of today's drives you may not gain that
much.  I leave all of my drives running and disable all power savings
features on all my systems.  I had a pair of 9GB Seagate Barracuda SCSI
drives that were still running strong after 14 years of continuous 7.2k
RPM service when I decommissioned the machine.  They probably won't spin
up now that they've been in storage for many years.

> If all of this would have been true than I would be willing to pay the
> price of less performance and higher raid problem rate.

Throttling an idle CPU down to half its normal frequency saves more
electricity than spinning down your hard drives, until you have 10 or
more, and that depends on which CPU you have.  If it's a 130w Intel
burner, it'll be more like 15 drives.

> But I believe you that the disks don't live as long as normal drives.
> So everything is different and I won't buy green drives again :-)

I'd "play it by ear".  These problems may have been worked out on the
newer "green" drives.  Bearings can be built to survive this more rapid
thermal cycling, those on vehicle wheels do it daily.  Once they get the
bearings right, these drives should last just as long.

> Maybe I will solder a flash light for the LAN LEDs in the front of the
> case too :-D

Just look at the LEDs on the switch its plugged into.  If the switch is
on the other side of the room, by a mini switch and set it on top.  $10
for a 10/100 and $20 for a GbE switch.  USD.

> I never was frustrated because of your help. If I was a unhappy it was
> only because of my missing knowledge and luck.

Well your luck should have changed for the better.  You've got all good
quality gear now and it should continue to work well together, barring
future bugs introduced in the kernel.

> If you weren't here to suggest things and help me I would have ended up
> with a case that I couldn't use in the worst case. Or one that eats my
> data (because of the Supermicro AOC-SASLP-MV8 controllers I initially
> had).

That controller has caused so many problems for Linux users I cannot
believe SM hasn't put a big warning on their site, or simply stopped
selling it, replacing it with something that works.  Almost of their
gear is simply awesome.  This one board gives SM a black eye.

> In the end I'm very happy and proud of my system. Of course I show it
> to my friends and they are jealous for sure :-)

That's great. :)

> So thanks very much again and please let me know how I can buy you a
> beer or two!

As always, you're welcome.  And sure, feel free to donate to my beer
fund. ;)

-- 
Stan

Reply to:

References:
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LSI MegaRAID SAS 9240-4i hangs system at boot
  - From: Ramon Hofer <ramonhofer@bluewin.ch>

Prev by Date: Re: resize pictures received by mailserver
Next by Date: Re: aptitude full-upgrade installs unnecessary packages
Previous by thread: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
Next by thread: Re: LSI MegaRAID SAS 9240-4i hangs system at boot
Index(es):
- Date
- Thread