[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Single root filesystem evilness decreasing in 2010? (on workstations) [LONG]



On Thu, 4 Mar 2010, thib wrote:

OTOH - I haven't studied XFS - but from the little overviews I read about
it, I suppose its allocation groups are a way to scale with this problem
(along with other unrelated advantages like parallelism in multithreaded
environments).  What happens if a filesystem doesn't have anything like it?

Filesystems will hit scale problems at some point. As you note AGs in XFS help it to scale alot but you do need to be careful in selecting the number. Too many and you can become CPU bound.

Maybe no-one cares because we currently don't have filesystems big enough to
actually see the problem?

Some people definitely do.

I agree with that, but I know it's because I, personally, *need* to know
what's going on, all the time.  Some people are OK with letting a program
(even such a critical one) do some magic;  and without having tested any
"complex" one, I suspect they try to KIS for the user.

The problem is that if a backup system breaks you get to keep both pieces :) Failing to understand your backup system and now you can DR under the worst case is a serious risk.

The problem is, if there's a problem with the backup system itself, then
it's going to be a long night.  If there's no need for such software, I,
again, agree, there's no use to take risks, even if they're minimal.

Amanda is a good example. I keep 'backup state information at the beginning of the tapes and allows the information to be dumped to a test file easily. I have done a 10TB SAN DR with Amanda and used printed out pages of the tape state information to guide me. It was relatively painless considering the amount of data I was bringing back.

Considering your experience, I have to believe you;  we can always backup
very simply, even very large systems.  It's just weird to picture, all these
complex backup systems would be useless?  (I know, it's not a binary answer,
but you know what I mean.)

I'm not saying they are useless but organisation do need to take more time considering DR I think. Large organisations will have fully operational DR sites and they can afford to run a database for their backup system since they can expect at least one of their sites to be operational at any given time.

I have known people who run a copy of the backup DB on a laptop which is supposedly kept offsite. These laptops likely come on site occassionally and they are a prime candidate for bitrot.

Anything that gets between me and data restoration makes me nervous :)

And for those people who think that off-site/off-line backups aren't needed anymore because you can just replicate data across the network, I'll give you 5 minutes to find the floor in that plan :)

I guess I'm perfectly OK with that, but are we still talking about
workstations?  :-)

I'm talking about servers. There is no substitute for offsite/offline backups and there never will be. This is one of the few topics were I will use absolute statements like this.

You can never predict the nature of the failure. If you try to figure out how a failure will occur then you will sooner or later run in to a failure of imagination.

The only way to guarantee against a single disaster of a certain size is to physically seperate the data stores by a sufficient distance and keep the backups offline.

No technology can change this fundamental truth since our understanding of the possible failure modes will always be incomplete.

My understanding is that the "cached" column of the output of free(1) is the
sum of all pages, clean and dirty.  The "buffers" column would be the

Right. It might be nice if free did display them seperately. It would confuse people less then :) /proc certain present the info. Checkout the source of 'free' - it is a really simple application.

Since there's no "cached" column for the swapspace, I guess no clean page
gets pushed there, although it could be useful if that space is on a
significantly faster volume.  Anyway, the "used" column should be the total,
actual swapspace used, so your comment kind of confuses me.  Am I really
wrong here?

I'd recommend doing some reading. The cached system memory and the swap space disaplayed by free are really unrelated concepts (at least at the level we're talking about here).

If you want to chat on IRC about fun subjects like caching and swap space sometime you can find me as Solver on Freenode & OFTC.

Cheers,

Rob

--
Email: robert@timetraveller.org
IRC: Solver
Web: http://www.practicalsysadmin.com
I tried to change the world but they had a no-return policy


Reply to: