[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Storage server



On 9/7/2012 3:16 PM, Bob Proulx wrote:

> Agreed.  But for me it isn't about the fsck time.  It is about the
> size of the problem.  If you have full 100G filesystem and there is a
> problem then you have a 100G problem.  It is painful.  But you can
> handle it.  If you have a full 10T filesystem and there is a problem
> then there is a *HUGE* problem.  It is so much more than painful.

This depends entirely on the nature of the problem.  Most filesystem
problems are relatively easy to fix even on 100TB+ filesystems,
sometimes with some data loss, often with only a file or few being lost
or put in lost+found.  If you have a non-redundant hardware device
failure that roasts your FS, then you replace the hardware make a new
FS, and restore from D2D or tape.  That's not painful, that's procedure.

> Therefore when practical I like to compartmentalize things so that
> there is isolation between problems.  Whether the problem is due to a
> hardware failure, a software failure or a human failure.  All of which
> are possible.  Having compartmentalization makes dealing with the
> problem easier and smaller.

Sounds like you're mostly trying to mitigate human error.  When you
identify that solution, let me know, then patent it. ;)

>> Whjat?  Are you talking crash recovery boot time "fsck"?  With any
>> modern journaled FS log recovery is instantaneous.  If you're talking
>> about an actual structure check, XFS is pretty quick regardless of inode
>> count as the check is done in parallel.  I can't speak to EXTx as I
>> don't use them.
> 
> You should try an experiment and set up a terabyte ext3 and ext4
> filesystem and then perform a few crash recovery reboots of the
> system.  It will change your mind.  :-)

As I've never used EXT3/4 and thus have no opinion, it'd be a bit
difficult to change my mind.  That said, putting root on a 1TB
filesystem is a brain dead move, regardless of FS flavor.  A Linux
server doesn't need more than 5GB of space for root.  With /var, /home/
and /bigdata on other filesystems, crash recovery fsck should be quick.

> XFS has one unfortunate missing feature.  You can't resize a
> filesystem to be smaller.  You can resize them larger.  But not
> smaller.  This is a missing feature that I miss as compared to other
> filesystems.

If you ever need to shrink a server filesystem: "you're doing IT wrong".

> Unfortunately I have some recent FUD concerning xfs.  I have had some
> recent small idle xfs filesystems trigger kernel watchdog timer
> recoveries recently.  Emphasis on idle.

If this is the bug I'm thinking of, "Idle" has nothing to do with the
problem, which was fixed in 3.1 and backported to 3.0.  The fix didn't
hit Debian 2.6.32.  I'm not a Debian kernel dev, ask them why--likely
too old.  Upgrading to the BPO 3.2 kernel should fix this and give you
some nice additional performance enhancements.  2.6.32 is ancient BTW,
released almost 3 years ago.  That's 51 in Linux development years. ;)

If you're going to recommend to someone against XFS, please
qualify/clarify that you're referring to 3 year old XFS, not the current
release.

> Definitely XFS can handle large filesystems.  And definitely when
> there is a good version of everything all around it has been a very
> good and reliable performer for me. I wish my recent bad experiences
> were resolved.

The fix is quick and simple, install BPO 3.2.  Why haven't you already?

> But for large filesystems such as that I think you need a very good
> and careful administrator to manage the disk farm.  And that includes
> disk use policies as much as it includes managing kernel versions and
> disk hardware.  Huge problems of any sort need more careful management.

Say I have a 1.7TB filesystem and a 30TB filesystem.  How do you feel
the two should be managed differently, or that the 30TB filesystem needs
kid gloves?

>> When using correctly architected reliable hardware there's no reason one
>> can't use a single 500TB XFS filesystem.
> 
> Although I am sure it would work I would hate to have to deal with a
> problem that large when there is a need for disaster recovery.  I
> guess that is why *I* don't manage storage farms that are that large. :-)

The only real difference at this scale is that your backup medium is
tape, not disk, and you have much phatter pipes to the storage host.  A
500TB filesystem will reside on over 1000 disk drives.  It isn't going
to be transactional or primary storage, but nearline or archival
storage.  It takes a tape silo and intelligent software to back it up,
but a full restore after catastrophe doesn't have (many) angry users
breathing down your neck.

On the other hand, managing a 7TB transactional filesystem residing on
48x 300GB SAS drives in a concatenated RADI10 setup, housing, say,
corporate mailboxes for 10,000 employees, including the CxOs, is a much
trickier affair.  If you wholesale lose this filesystem and must do a
full restore, you are red meat, and everyone is going to take a bite out
of your ass.  And you very well may get a pink slip depending on the
employer.

Size may matter WRT storage/filesystem management, but it's the type of
data you're storing and the workload that's more meaningful.

-- 
Stan



Reply to: