[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Invalid UTF-8 byte? (was: Re: utf)

Hash: SHA1

On Mon, Apr 02, 2018 at 08:37:54AM -0400, rhkramer@gmail.com wrote:
> On Monday, April 02, 2018 03:39:05 AM Andre Majorel wrote:
> > > Why? UTF (especially UTF-8) is vastly superior for all purposes:
> > I wouldn't say that. UTF-8 breaks a number of assumptions. For
> > instance,
> > 1) every character has the same size,
> > 2) every byte sequence is a valid character,
> A few weeks ago, I was looking for a byte that, in UTF-8, would be a totally 
> invalid byte (not an invalid sequence of bytes).

If you look at man utf-8 (7), you'll find, for the encoding:

       The following byte sequences are used to represent a character.
       The sequence to be used depends on the UCS code number of the character:

       0x00000000 - 0x0000007F:

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

This means that 1111111x are the only (two) illegal bytes in UTF-8,
at least currently. I don't know what will happen once we need code
points beyond 0x7fffffff -- perhaps Klingon has an ideographic
variant our linguists don't know of yet (the alphabetical variant
is in the private area and an official place seems to be in the
works [1]).

So I wouldn't build my house on it. And oh, there are better
serialization "protocols" than using an arbitrary record
separator. Why not use some of the lower ASCII thingies plus
an escape mechanism?


[1] http://www.klingonwiki.net/En/Unicode
- -- t
Version: GnuPG v1.4.12 (GNU/Linux)


Reply to: