Re: backup archive format saved to disk
On Tue, Dec 05, 2006 at 10:53:13AM +0100, Johannes Wiedersich wrote:
> Douglas Tutty wrote:
> > I'm going to be backing up to a portable ruggedized hard drive.
> > Currently, my backups end up in tar.bz2 format.
> > It would be nice if there was some redundancy in the data stream to
> > handle blocks that go bad while the drive is in storage (e.g. archive).
> > How is this handled on tape? Is it built-into the hardware
> > compression?
> > Do I need to put a file system on a disk partition if I'm only saving
> > one archive file or can I just write the archive to the partition
> > directly (and read it back) as if it was a scsi tape?
> > Is there an archive or compression format that includes the ability to
> > not only detect errors but to correct them? (e.g. store ECC data
> > elsewhere in the file) If there was, and I could write it directly to
> > the disk, then that would solve the blocks-failing-while-drive-stored
> > issue.
> Now, to something completely different....
> If data integrity is your concern, than maybe a better solution than
> compression is to copy all your data with rsync or another backup tool
> that 'mirrors' your files instead of packing them all together in one
> large file. If something goes wrong with this large file you might loose
> the backup of all your files. If something goes wrong with the
> transmission of one file in the rsync case you will only 'loose' the
> backup of that one file and just restart the rsync command.
> Well, at least I much prefer to spend a bit more on storage and have all
> my files copied individually. It adds the benefit that it is
> straightforward to verify the integrity of the backup via 'diff -r'.
> As far as redundancy is concerned I would prefer to use a second disk
> (and while you are at it store it in a different location, miles away
> from the other). I have one backup at home and another one at my
> mother's house, adding several layers of security to my data.
Yes I use JFS for my file systems. I have raid1 on my main drives. I
will have one portable drive at home, so several layers of backup here.
The issue is off-site backup and that's where the disk in the bank comes
The problem is that a journal on a hard disk only protects the
filesystem from an inconsistant state due to power failure. It does
nothing to protect the data if it was written correctly 5 years ago and
never mounted since. If a block or two goes bad then that piece of data
is lost. It could make the filesystem unmountable.
I haven't been able to find a filesystem that provides redundancy that
is free. The companies that pioneered disk-based virtual tape serves
have their own (e.g. Veritas). This is why I'm looking at archive
The idea is that a format with built-in error-correcting would scatter
the redundancy around the disk so that if a few blocks are bad, the data
can still be retreived.
Even raid1 doesn't accomplish this. With raid1 and two disks, if both
disks have bad blocks appear, even if they are on different spots on
each drive, as far as I can tell raid1 can't create a virtual pristine
partition out of several damaged ones.
Searching aptitude, there seem to be a few packages that address this
issue obliquely (given two corrupted archives, can create a single
pristine archive) but need two complete archive sets. I have to look at
the par spec.
Basically, I want to do for my archives what ECC does for memory. With
ECC memory, for every 8 bits, there's one extra bit of storage. It can
fix single-bit errors. If I'm remembering my math right, ECC adds 15%
to the size of an archive __prior_to_compression__. Its impossible to
do with less than 1:1 (100%) on compressed data. Its therefore best
done from __within__ the compression algorithm. Take a block of data
from the input stream, make the ECC data, compress the block of data,
append the ECC, and spit this to the output stream and write the ECC
data to an ECC stream. At the end of the input stream, take the ECC
stream, make ECC data for that, compress the ECC stream, append the ECC
for that, spit this to the output stream.
If par doesn't do what I need and I can't find an alternative, I'll just
write my own, modeled first in python, then done in Fortran77 for speed.
If I go to all this trouble, I'd probably throw in AES for good measure.
It would make a fun project but I hate reinventing perfectly good
wheels. Then again, I know people who jump out of perfectly good
airplanes. Go figure.