[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Invalid UTF-8 byte? (was: Re: utf)



On Tue, 03 Apr 2018, Michael Lange wrote:
> I believe (please anyone correct me if I am wrong) that "text" files
> won't contain any null byte; many text editors even refuse to open such a

Depends on the encoding.  For ASCII, ISO-8859-* and UTF-8 (and any other
modern encoding AFAIK, other than modified UTF-8), any zero bytes map
one-to-one to the NUL character/code point.  I don't recall how it is on
other common encodings of the 80's and 90's, though.

Some even-more-modern encodings (modified UTF-8 :p) simply do NOT use
bytes with the value of zero when encoding characters, so NUL is encoded
by a different sequence, and you can safely use a byte with the value of
zero for some out-of-band control (like zero-terminated strings that can
contain NULs, etc) -- note that NUL is a character, and it might be
represented by a sequence of bytes that has nothing to do with zeroes on
a particular encoding...

(in fact, C strings are *zero-terminated*, not NUL-terminated, but most
of the time this is irrelevant :p).

Also, a text file MAY contain NULs (the character), it is just
considered bad practice (nowadays?).  Don't assume you won't see any.
For example, received e-mail is *more* likely to have NULs in it than
normal text due to the quality of some mail agents out there.  I recall
postfix would reject a *lot* of crap when we configured it to refuse to
accept NULs outside of 8-bit bodies, because Cyrus-IMAPd *refuses* any
such crap, and we wanted it bounced as early as possible.

(note that NULs are forbidden in MIME-compliant email text and ESMTP,
unless encoded or guarded by a 8-bit transfer area of known size, so
there you have it: NULs in one text format that actually forbids them!).

> Probably it is the same with some other control characters like 04 (End
> of Transmission). When I look at https://en.wikipedia.org/wiki/ASCII
> it seems like 1C (File Separator) or 1E (Record Separator) might be 
> appropriate choices for you. I'm no expert on this, though.

Well, ASCII control characters were inherited by ISO-8859-* and Unicode,
so yes, you can use them.  But so could the data file.  It would be
perfectly ok for a text data file to use the record separator control
characters to delimit records in a table, for example...

Here's a good definition of them (follow the hyperlinks for the
definition of each control character):
https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)


Here is also a proper solution: use modified UTF-8 (which encodes NUL so
that zero bytes are *never* present in the stream): encode every input
format to modified UTF-8, then add the zero-byte separators you want.

You'll have to normalize the input data set into known charset/encodings
and then recode them to modified UTF-8, of course.  You can't blindly
call any random data "UTF-8" (let alone modified UTF-8) and expect
things not to break horribly.

-- 
  Henrique Holschuh


Reply to: