Re: Invalid UTF-8 byte? (was: Re: utf)

To: debian-user@lists.debian.org
Subject: Re: Invalid UTF-8 byte? (was: Re: utf)
From: <tomas@tuxteam.de>
Date: Mon, 2 Apr 2018 20:40:55 +0200
Message-id: <[🔎] 20180402184055.GA582@tuxteam.de>
In-reply-to: <[🔎] 20180402181838.koldjxemi4yegcqx@khazad-dum.debian.net>
References: <[🔎] 92aa2f6d-d39f-61a6-311b-f0c45b00b9c9@gmx.com> <[🔎] 201804020837.54725.rhkramer@gmail.com> <[🔎] 20180402130552.3warkbxqma2fhbjo@khazad-dum.debian.net> <[🔎] 201804021341.28941.rhkramer@gmail.com> <[🔎] 20180402181838.koldjxemi4yegcqx@khazad-dum.debian.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Apr 02, 2018 at 03:18:38PM -0300, Henrique de Moraes Holschuh wrote:
> On Mon, 02 Apr 2018, rhkramer@gmail.com wrote:
> > The wikipedia article is rather interesting, in a quick skim, I learned some 
> > interesting things about UTF-8, especially the property of self-
> > synchronization.
> 
> Yes, UTF-8 is a brilliant design.

Possibly relevant, definitely entertaining, Rob Pike's account
of UTF-8's gestation [1]

Yeah. Elegant design. Until the Unicode Consortium left Microsoft
near it (Byte Order Mark, I'm looking at you!).

[...]

> > I guess I have a followup question--are those two bytes (or either one of 
> > them) also unused in all possible "code pages"?  

I'm not sure what you mean here: there are two layers at work (at least
if you have UTF-8 encoded Unicode). As Henrique says, if you assume
both to be "correct" then you get more illegal things. But sometimes
UTF-8 encoding is used for other things (notably Emacs encodes a superset
of Unicode, to be able to express "raw byte values" next to "Unicode
characters".

> > The problem is that I copy snippets of text from all kinds of sources into 
> > those text files (which are formatted like mbox files), so I might find one or 
> > both of those bytes in the file already.
> 
> Then it isn't a valid unicode text file in UTF-8 format, and it needs to
> be converted (or fixed) first to be encoded in UTF-8 :-)

Agreed: if you don't know what's coming in, you better plan for anything :)

Cheers
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlrCeTcACgkQBcgs9XrR2kbRtgCfaRHoodlkFFt8Gm0Oq438ymvg
0oMAn2NkpsqMJ3Tcy5BvAJIpTvfG8mdj
=iVqF
-----END PGP SIGNATURE-----

Reply to:

Follow-Ups:
- Re: Invalid UTF-8 byte? (was: Re: utf)
  - From: tomas@tuxteam.de
- Re: Invalid UTF-8 byte? (was: Re: utf)
  - From: rhkramer@gmail.com

References:
- utf
  - From: mess-mate <mess-mate@gmx.com>
- Invalid UTF-8 byte? (was: Re: utf)
  - From: rhkramer@gmail.com
- Re: Invalid UTF-8 byte? (was: Re: utf)
  - From: Henrique de Moraes Holschuh <hmh@debian.org>
- Re: Invalid UTF-8 byte? (was: Re: utf)
  - From: rhkramer@gmail.com
- Re: Invalid UTF-8 byte? (was: Re: utf)
  - From: Henrique de Moraes Holschuh <hmh@debian.org>

Prev by Date: Re: Chaniging focus: security ouitside a password manager (was: Re: Password Manager opinions and recommendations)
Next by Date: Re: All of my enoX interfaces are mapped to eth0
Previous by thread: Re: Invalid UTF-8 byte? (was: Re: utf)
Next by thread: Re: Invalid UTF-8 byte? (was: Re: utf)
Index(es):
- Date
- Thread