Re: utf

To: debian-user@lists.debian.org
Subject: Re: utf
From: Ben Caradoc-Davies <ben@transient.nz>
Date: Wed, 4 Apr 2018 08:47:50 +1200
Message-id: <[🔎] 53b3878e-424c-73c5-5cb5-bf5728b18b24@transient.nz>
In-reply-to: <[🔎] 20180403085551.GA30859@darac.org.uk>
References: <[🔎] 92aa2f6d-d39f-61a6-311b-f0c45b00b9c9@gmx.com> <[🔎] 0a5c15a9-0dfc-1ef3-1f64-1880def0ff1e@transient.nz> <[🔎] 20180402073904.GB19322@aym.net2.nerim.net> <[🔎] 20180403085551.GA30859@darac.org.uk>

On 03/04/18 20:55, Darac Marjal wrote:

If these things matter to you, it's better to convert from UTF-8 toUnicode, first.

Fixed length encodings like UTF-32 will not fix broken assumptions aboutsome relationship between byte length and number of characters becauseUnicode contains things like combining characters. What is the length ofa string? Are you trying to count the number of glyphs? I do not thinkthat you can do this by naïvely counting code points, regardless ofencoding.

Because there is more than one way to represent an accented character,Unicode string comparison is nontrivial:

https://en.wikipedia.org/wiki/Unicode_equivalence

Kind regards,

--
Ben Caradoc-Davies <ben@transient.nz>
Director
Transient Software Limited <https://transient.nz/>
New Zealand

Reply to:

Follow-Ups:
- Re: utf
  - From: Nicolas George <george@nsup.org>

References:
- utf
  - From: mess-mate <mess-mate@gmx.com>
- Re: utf
  - From: Ben Caradoc-Davies <ben@transient.nz>
- Re: utf
  - From: Andre Majorel <aym-naibed@teaser.fr>
- Re: utf
  - From: Darac Marjal <mailinglist@darac.org.uk>

Prev by Date: Re: Unknown Systemd version
Next by Date: Re: utf
Previous by thread: Re: utf
Next by thread: Re: utf
Index(es):
- Date
- Thread