Re: Invalid UTF-8 byte? (was: Re: utf)

To: debian-user@lists.debian.org
Subject: Re: Invalid UTF-8 byte? (was: Re: utf)
From: Henrique de Moraes Holschuh <hmh@debian.org>
Date: Wed, 4 Apr 2018 08:18:23 -0300
Message-id: <[🔎] 20180404111823.rcrwymrtpctajffi@khazad-dum.debian.net>
In-reply-to: <[🔎] 20180403135833.3156da4df8b9e11298ae6306@freenet.de>
References: <[🔎] 92aa2f6d-d39f-61a6-311b-f0c45b00b9c9@gmx.com> <[🔎] 201804020837.54725.rhkramer@gmail.com> <[🔎] 20180403004328.f49e19cbe32cfd5773b9e5e7@freenet.de> <[🔎] 201804030743.02707.rhkramer@gmail.com> <[🔎] 20180403135833.3156da4df8b9e11298ae6306@freenet.de>

On Tue, 03 Apr 2018, Michael Lange wrote:
> I believe (please anyone correct me if I am wrong) that "text" files
> won't contain any null byte; many text editors even refuse to open such a

Depends on the encoding.  For ASCII, ISO-8859-* and UTF-8 (and any other
modern encoding AFAIK, other than modified UTF-8), any zero bytes map
one-to-one to the NUL character/code point.  I don't recall how it is on
other common encodings of the 80's and 90's, though.

Some even-more-modern encodings (modified UTF-8 :p) simply do NOT use
bytes with the value of zero when encoding characters, so NUL is encoded
by a different sequence, and you can safely use a byte with the value of
zero for some out-of-band control (like zero-terminated strings that can
contain NULs, etc) -- note that NUL is a character, and it might be
represented by a sequence of bytes that has nothing to do with zeroes on
a particular encoding...

(in fact, C strings are *zero-terminated*, not NUL-terminated, but most
of the time this is irrelevant :p).

Also, a text file MAY contain NULs (the character), it is just
considered bad practice (nowadays?).  Don't assume you won't see any.
For example, received e-mail is *more* likely to have NULs in it than
normal text due to the quality of some mail agents out there.  I recall
postfix would reject a *lot* of crap when we configured it to refuse to
accept NULs outside of 8-bit bodies, because Cyrus-IMAPd *refuses* any
such crap, and we wanted it bounced as early as possible.

(note that NULs are forbidden in MIME-compliant email text and ESMTP,
unless encoded or guarded by a 8-bit transfer area of known size, so
there you have it: NULs in one text format that actually forbids them!).

> Probably it is the same with some other control characters like 04 (End
> of Transmission). When I look at https://en.wikipedia.org/wiki/ASCII
> it seems like 1C (File Separator) or 1E (Record Separator) might be 
> appropriate choices for you. I'm no expert on this, though.

Well, ASCII control characters were inherited by ISO-8859-* and Unicode,
so yes, you can use them.  But so could the data file.  It would be
perfectly ok for a text data file to use the record separator control
characters to delimit records in a table, for example...

Here's a good definition of them (follow the hyperlinks for the
definition of each control character):
https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)

Here is also a proper solution: use modified UTF-8 (which encodes NUL so
that zero bytes are *never* present in the stream): encode every input
format to modified UTF-8, then add the zero-byte separators you want.

You'll have to normalize the input data set into known charset/encodings
and then recode them to modified UTF-8, of course.  You can't blindly
call any random data "UTF-8" (let alone modified UTF-8) and expect
things not to break horribly.

-- 
  Henrique Holschuh

Reply to:

Follow-Ups:
- Re: Invalid UTF-8 byte? (was: Re: utf)
  - From: <tomas@tuxteam.de>
- Re: Invalid UTF-8 byte? (was: Re: utf)
  - From: Jonathan de Boyne Pollard <J.deBoynePollard-newsgroups@NTLWorld.COM>

References:
- utf
  - From: mess-mate <mess-mate@gmx.com>
- Invalid UTF-8 byte? (was: Re: utf)
  - From: rhkramer@gmail.com
- Re: Invalid UTF-8 byte? (was: Re: utf)
  - From: Michael Lange <klappnase@freenet.de>
- Re: Invalid UTF-8 byte? (was: Re: utf)
  - From: rhkramer@gmail.com
- Re: Invalid UTF-8 byte? (was: Re: utf)
  - From: Michael Lange <klappnase@freenet.de>

Prev by Date: Re: debian-9 Kde login screen problem and kde desktop problem
Next by Date: Re: Invalid UTF-8 byte? (was: Re: utf)
Previous by thread: Re: Invalid UTF-8 byte? (was: Re: utf)
Next by thread: Re: Invalid UTF-8 byte? (was: Re: utf)
Index(es):
- Date
- Thread