[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: backup archive format saved to disk



On Tue, Dec 05, 2006 at 05:01:36PM -0500, Douglas Tutty wrote:
> On Tue, Dec 05, 2006 at 10:53:13AM +0100, Johannes Wiedersich wrote:
> > Douglas Tutty wrote:
> > > I'm going to be backing up to a portable ruggedized hard drive.
> > > Currently, my backups end up in tar.bz2 format.
> > > 
> > > It would be nice if there was some redundancy in the data stream to
> > > handle blocks that go bad while the drive is in storage (e.g. archive).
> > > 
> > > How is this handled on tape?  Is it built-into the hardware
> > > compression?
> > > 
> > > Do I need to put a file system on a disk partition if I'm only saving
> > > one archive file or can I just write the archive to the partition
> > > directly (and read it back) as if it was a scsi tape?
> > > 
> > > Is there an archive or compression format that includes the ability to
> > > not only detect errors but to correct them? (e.g. store ECC data
> > > elsewhere in the file)  If there was, and I could write it directly to
> > > the disk, then that would solve the blocks-failing-while-drive-stored
> > > issue.
> > 
> > Now, to something completely different....
> > If data integrity is your concern, than maybe a better solution than
> > compression is to copy all your data with rsync or another backup tool
> > that 'mirrors' your files instead of packing them all together in one
> > large file. If something goes wrong with this large file you might loose
> > the backup of all your files. If something goes wrong with the
> > transmission of one file in the rsync case you will only 'loose' the
> > backup of that one file and just restart the rsync command.
> > 
> > Well, at least I much prefer to spend a bit more on storage and have all
> > my files copied individually. It adds the benefit that it is
> > straightforward to verify the integrity of the backup via 'diff -r'.
> > 
> > As far as redundancy is concerned I would prefer to use a second disk
> > (and while you are at it store it in a different location, miles away
> > from the other). I have one backup at home and another one at my
> > mother's house, adding several layers of security to my data.
> > 
> > Johannes
> > 
> Thanks Johannes,
> 
> Yes I use JFS for my file systems.  I have raid1 on my main drives.  I
> will have one portable drive at home, so several layers of backup here.
> The issue is off-site backup and that's where the disk in the bank comes
> in. 
> 
> The problem is that a journal on a hard disk only protects the
> filesystem from an inconsistant state due to power failure.  It does
> nothing to protect the data if it was written correctly 5 years ago and
> never mounted since.  If a block or two goes bad then that piece of data
> is lost.  It could make the filesystem unmountable.  
> 
> I haven't been able to find a filesystem that provides redundancy that
> is free.  The companies that pioneered disk-based virtual tape serves
> have their own (e.g. Veritas).  This is why I'm looking at archive
> formats.  
> 
> The idea is that a format with built-in error-correcting would scatter
> the redundancy around the disk so that if a few blocks are bad, the data
> can still be retreived.  
> 
> Even raid1 doesn't accomplish this.  With raid1 and two disks, if both
> disks have bad blocks appear, even if they are on different spots on
> each drive, as far as I can tell raid1 can't create a virtual pristine
> partition out of several damaged ones.  
> 
> Searching aptitude, there seem to be a few packages that address this
> issue obliquely (given two corrupted archives, can create a single
> pristine archive) but need two complete archive sets.  I have to look at
> the par spec.
> 
> Basically, I want to do for my archives what ECC does for memory.  With
> ECC memory, for every 8 bits, there's one extra bit of storage.  It can
> fix single-bit errors.  If I'm remembering my math right, ECC adds 15%
> to the size of an archive __prior_to_compression__.  Its impossible to
> do with less than 1:1 (100%) on compressed data.  Its therefore best
> done from __within__ the compression algorithm.  Take a block of data
> from the input stream, make the ECC data, compress the block of data,
> append the ECC, and spit this to the output stream and write the ECC
> data to an ECC stream.  At the end of the input stream, take the ECC
> stream, make ECC data for that, compress the ECC stream, append the ECC
> for that, spit this to the output stream.  

You need to add the ECC *after* the compression.  ECC adds redundancy 
that allows one to recover from a small amount of damage.

If you add ECC before compression, and, say, a single bit gets changed 
to the compressed archive, decompressing it will likely not yield a 
block with a small amount of damage; it will more likely yield total 
gibberish -- and ECC on that is not likely to help.

If you add ECC after compression, and a single bit gets changed, then 
ECC will make it possible to correct the compressed block, after which 
decompression will work.

If you want to be able to recover data despite damage, it is in general 
not wise to compress it, since different parts will be damaged 
independently, and the undamaged parts will still be readable.  
Squeezing out redundancy makes different parts of the data dependent on 
one another for interpretation.

-- hendrik



Reply to: