[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: utf



> On Mon, Apr 02, 2018 at 09:39:05AM +0200, Andre Majorel wrote:
> >I wouldn't say that. UTF-8 breaks a number of assumptions. For
> >instance,
> >1) every character has the same size,
> >2) every byte sequence is a valid character,
> >3) the equality or inequality of two characters comes down to
> >  the equality or inequality of the bytes they encode to.

I am sure you do not realize that none of these assumptions are really
met by any encoding and none of these assumptions actually bring
something. They are just rehashed poor arguments to rationalize a fear
of change by people afraid their long-earned knowledge will become
obsolete. I do not know if you fit in that category. Odds are you have
just been misinformed after having normal trouble during the transition.

Darac Marjal (2018-04-03):
> If these things matter to you, it's better to convert from UTF-8 to Unicode,
> first.

"Convert to Unicode" does not mean anything. Unicode is not a format,
and therefore you cannot convert something to it.

Unicode is a catalog of "infolinguistic entities". I do not say
characters, because they are not all. Most of Unicode is characters, but
not all. As is stands, the principle of Unicode is that any "pure" text
can be represented as a sequence of Unicode code points.

For storing in a file or sending on the network, this sequence of code
points must be converted into a sequence of octets. UTF-8 is by far the
best choice for that, because it has many interesting properties. Other
encodings suffer from being incomplete, being incompatible with ASCII,
being sensible to endianness problems, being subject to corruption or
all of the above.

>	 I tend to think of Unicode as an arbitrarily large code page. Each
> character maps to a number, but that number could be 1, 1000 or 500_000
> (Unicode seems to be growing without might end in sight).

The twentieth century just called, it wanks the "code page" idiom back.

>							    Internally, you
> might store those code points as Integers or QUad Words or whatever you
> like. Only once you're ready to transfer the text to another process (print
> on screen, save to a file, stream across a network), do you convert the
> Unicode back into UTF-8.

Internally, this is a reasonable choice in some cases, but not actually
that useful in most. Your discourse is based on the assumption that
accessing a single Unicode code point in the string would be useful.
Most of the time, it is not. Remember that an Unicode code point is not
necessarily a character.

You need to decide the data structure based on the treatments you intend
to perform on the text. And actually, most of the time using an array of
octets in UTF-8 is the best choice for internal representation too.

> Basically, you consider UTF-8 to be a transfer-only format (like Base64). If
> you want to do anything non-trivial with it, decode it into Unicode.

No, definitely not. If somebody wants to do anything non-trivial with
text, then either they already know what they are doing better than
this, and do not need that advice, or they do not and they will get it
wrong.

Use a library. And use whatever text format that library use.

The problem does not come from UTF-8 or Unicode or anything
computer-related, the problem comes from the principle of written human
text: writing systems are insanely complex.

Regards,

-- 
  Nicolas George

Attachment: signature.asc
Description: Digital signature


Reply to: