[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: backup archive format saved to disk



Douglas Tutty wrote:
Thanks Mike,

If I can attempt to summarize a portion of what you said:

	If the issue is resistance to data block errors, it doesn't
	matter if I use a file system or not so I may as well use a file
	system then if have difficulty, rip multiple copies of the file
	system bit by bit and do majority rules.

Well, not quite. "Majority rules" is a simple BCH code which anyone
could do with a simple program. All you need is an odd number of
copies. The distance of the code is the number of copies, so the
number of correctable errors (per bit, I mean) is (n-1)/2. So, with
three copies, one can correct any single error, with five copies
one can correct double errors (in a single bit), etc. It's not
the most efficient, but it is very very simple.

		There's a package (forget the name) that will do this
		with files: take multiple damaged copies and make one
		good copy if possible.

Multiple copies on a single medium are not, IMO, advisable. There are
errors which can occur which prevent reading a single bit anywhere.
If you sit on a CDROM, you may make it completely unreadable :-)

Does the kernel software-raid in raid1 do this?  Would there be any
advantage/disadvantage to putting three partitions on the drive and
setting them up as raid1? (and record the partition table [sfdisk -d]
separately)?

I am not familiar with the internals of Linux' software RAID.

Googling this topic, I find sporatic posts on different forums whishing
for something like this but there doesn't seem to be anything
off-the-shelf for linux.  It seems to be what the data security
companies get paid for (e.g. the veritas filesystem).  Do you know of
anything?

I have used Veritas on some telephony equipment, and it worked well
enough.

I understand you description of FEC and I guess that's what we're
talking about.  In the absence of a filesystem that does it, I want a
program that takes a data stream (e.g. a tar.bz2 archive) and imbeds FEC
data in it so it can be stored, then later can take that data and
generate the origional data stream.
>
Do I understand you correctly that the FEC-embedded data stream to be
effective will be three times the size of the input stream?

That is correct. The larger the Hamming distance used in the code, the
more redundancy is required. It is unfortunate that the very best codes
we have been able to design fall quite a bit short of what is the
theoretical limit, which is itself somewhat disappointing to those
with naive expectations. Generally, one uses about 1/3 data and 2/3
check bits.

Does it matter if this FEC data is embeddded with the data or appended?

Depends on what you mean by the question. First, you presuppose that
in an arbitrary code, the original data survive intact, and some
additional redundancy is added. This would be called a systematic
code. Not all codes are systematic, though the very best practical
block[1] codes we have can all be put into systematic form. The very
best block codes we've cooked up for experimentation purposes are *not*
systematic, and the original data do not survive in any distinct form
apart from the entire code word. IOW, in these codes there aren't
distinguishable "data bits" and "check bits", there are just
"code bits", and without going through the decoding process, there
is no simple way to get the data bits back. AFAIK, none of these
experimental codes has ever been applied in a practical system,
as coding and decoding are too problematic.[2] But they are much
more compact codes, and more closely approach the theoretical
limits on the total code word size required to achieve a given
degree of correction ability.

Second, one of the best ways to make a code which can correct
bursts as well as independent bit errors (many media are
susceptible to bursts) is to use interleaved codes.

Here's a simple example. Let's go back to the "transmit each bit
three times" example I gave of CDROMs. That code is an example
of an interleaved code, which is one of the reasons it has such
good burst correcting capabilities. If we used a non-interleaved
code, we would simply record each bit three times on the disc,
and let it span multiple discs. Then it would have single bit
correcting ability, but no burst correcting ability.

If, OTOH, we have three discs which are each a complete copy,
then we can recover from any single burst which is less or equal to
any single disc in length. This is an interleaved code.

If we use 5x transmission, we can recover from any double bit
error, and any burst of up to two bits as well. But if we use
5 separate discs, once again we can recover from any single burst
which spans up to two discs.

The code used on CDs for music are interleaved Reed-Solomon codes,
one used for "fingerprints" (local errors) and another which spans
about half the circumference for "track errors" or whatever they
are called. Long bursts, anyway. That's AIUI; I'm not an expert on
CDs by any means.

If this doesn't exist for linux, do you know of any open-source
non-linux implementations that just need some type of porting?  I've
found a couple of technical papers discussing the algorithms
(Reed-Solomon) used in the par2 archive that I'll study.

I do not. Most implementations are Reed-Solomon, or some other
variation of BCH codes (of which the RS codes are a subset),
and are closely guarded secrets.

Well, I guess there are some RAID systems which are open source
and which could be used for that sort of thing. Again, using
the idea of CDROMs, one could make a bunch of them look like
a RAID system. I suppose that the software would require some
serious hacking to make it able to deal with that sort of set
up.

[1] I've been presuming block codes as opposed to stream
codes in all that I've written so far, though most of
what I've said also applies to stream codes.

[2] These codes all require enormous look-up tables and
CPU time, or even larger look-up tables if one wants to
use less CPU. The code words are essentially random bits
which seemingly cannot be generated by some algorithmic
procedure. Any code which would be capable of encoding
any reasonable amount of data would require a table which
wouldn't fit into any reasonable amount of memory. IOW,
a code capable of encoding 128 byte blocks, would require
perhaps another 200 bytes of check information, and the
code word would be perhaps 320 bytes long. To do decoding would
require a table of 2560 bits in each entry, and would require
2^128 entries. Hardly a practical code. These values
(which I made up) are extrapolations from what we observe
in creating some small block codes which are efficient
in code word size. This all points up that the word
"efficient" has different meanings in different contexts.

Mike
--
p="p=%c%s%c;main(){printf(p,34,p,34);}";main(){printf(p,34,p,34);}
This message made from 100% recycled bits.
You have found the bank of Larn.
I can explain it for you, but I can't understand it for you.
I speak only for myself, and I am unanimous in that!



Reply to: