Re: Invalid UTF-8 byte? (was: Re: utf)

To: debian-user@lists.debian.org
Subject: Re: Invalid UTF-8 byte? (was: Re: utf)
From: <tomas@tuxteam.de>
Date: Mon, 2 Apr 2018 15:02:13 +0200
Message-id: <[🔎] 20180402130213.GB22111@tuxteam.de>
In-reply-to: <[🔎] 201804020837.54725.rhkramer@gmail.com>
References: <[🔎] 92aa2f6d-d39f-61a6-311b-f0c45b00b9c9@gmx.com> <[🔎] 0a5c15a9-0dfc-1ef3-1f64-1880def0ff1e@transient.nz> <[🔎] 20180402073904.GB19322@aym.net2.nerim.net> <[🔎] 201804020837.54725.rhkramer@gmail.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Apr 02, 2018 at 08:37:54AM -0400, rhkramer@gmail.com wrote:
> On Monday, April 02, 2018 03:39:05 AM Andre Majorel wrote:
> > > Why? UTF (especially UTF-8) is vastly superior for all purposes:
> > I wouldn't say that. UTF-8 breaks a number of assumptions. For
> > instance,
> > 1) every character has the same size,
> > 2) every byte sequence is a valid character,
> 
> A few weeks ago, I was looking for a byte that, in UTF-8, would be a totally 
> invalid byte (not an invalid sequence of bytes).

If you look at man utf-8 (7), you'll find, for the encoding:

   Encoding
       The following byte sequences are used to represent a character.
       The sequence to be used depends on the UCS code number of the character:

       0x00000000 - 0x0000007F:
           0xxxxxxx

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

This means that 1111111x are the only (two) illegal bytes in UTF-8,
at least currently. I don't know what will happen once we need code
points beyond 0x7fffffff -- perhaps Klingon has an ideographic
variant our linguists don't know of yet (the alphabetical variant
is in the private area and an official place seems to be in the
works [1]).

So I wouldn't build my house on it. And oh, there are better
serialization "protocols" than using an arbitrary record
separator. Why not use some of the lower ASCII thingies plus
an escape mechanism?

Cheers

[1] http://www.klingonwiki.net/En/Unicode
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlrCKdUACgkQBcgs9XrR2kbBnQCfe5c4WVNYCcpZbsgg5dDZwBHR
XNkAn2CSUkMY59VE2zLciII/3kUz8W45
=aPFG
-----END PGP SIGNATURE-----

Reply to:

References:
- utf
  - From: mess-mate <mess-mate@gmx.com>
- Re: utf
  - From: Ben Caradoc-Davies <ben@transient.nz>
- Re: utf
  - From: Andre Majorel <aym-naibed@teaser.fr>
- Invalid UTF-8 byte? (was: Re: utf)
  - From: rhkramer@gmail.com

Prev by Date: Invalid UTF-8 byte? (was: Re: utf)
Next by Date: Re: Invalid UTF-8 byte? (was: Re: utf)
Previous by thread: Invalid UTF-8 byte? (was: Re: utf)
Next by thread: Re: Invalid UTF-8 byte? (was: Re: utf)
Index(es):
- Date
- Thread