Re: utf

To: debian-user@lists.debian.org
Subject: Re: utf
From: Stefan Monnier <monnier@iro.umontreal.ca>
Date: Wed, 04 Apr 2018 17:58:17 -0400
Message-id: <[🔎] jwvmuyimyid.fsf-monnier+gmane.linux.debian.user@gnu.org>
References: <[🔎] 20180402073904.GB19322@aym.net2.nerim.net> <[🔎] 20180403085551.GA30859@darac.org.uk> <[🔎] 53b3878e-424c-73c5-5cb5-bf5728b18b24@transient.nz> <[🔎] 20180403205143.GA1711857@phare.normalesup.org> <[🔎] 20180403205635.sod2mqtsyiciwc4h@eeg.ccf.org> <[🔎] 20180403211956.GA1741492@phare.normalesup.org> <[🔎] pa304e$m0n$1@blaine.gmane.org> <[🔎] 20180404170701.GA2415172@phare.normalesup.org> <[🔎] pa322p$5s1$1@blaine.gmane.org> <[🔎] 20180404173537.GA2430845@phare.normalesup.org> <[🔎] 20180404174521.byk3wpwe5jlidxc3@eeg.ccf.org>

> You just seem to have Decided, for reasons known only to you, that
> The Character Length Of A String Is Not Useful.  Despite literally
> decades of programs that have used strlen() in various ways.

strlen was mostly used in a context where char-length = byte-length =
display-width.  Most of those calls to strlen have nothing to do with
char-length but are more interested in display-width or byte-length.

In the context of Unicode, using utf-8 doesn't make byte-length any
harder than with ASCII.  And in the context of Unicode, display-width
is a lot more complex than strlen regardless of which encoding you use
because any given Unicode char can have a display-width of 0, 1, or
2 (even if you disregard proportional fonts and other fancy rendering
tricks).  So utf-8 doesn't make the computation of display-width any
more complex than utf-32.

> What if the question is "Find all the English words that have an E
> in the 5th position and a U in the 7th"?

That can be answered just as easily and efficiently from a utf-8
representation of the string as from a utf-32 representation.


        Stefan

Reply to:

References:
- Re: utf
  - From: Andre Majorel <aym-naibed@teaser.fr>
- Re: utf
  - From: Darac Marjal <mailinglist@darac.org.uk>
- Re: utf
  - From: Ben Caradoc-Davies <ben@transient.nz>
- Re: utf
  - From: Nicolas George <george@nsup.org>
- Re: utf
  - From: Greg Wooledge <wooledg@eeg.ccf.org>
- Re: utf
  - From: Nicolas George <george@nsup.org>
- Re: utf
  - From: deloptes <deloptes@gmail.com>
- Re: utf
  - From: Nicolas George <george@nsup.org>
- Re: utf
  - From: deloptes <deloptes@gmail.com>
- Re: utf
  - From: Nicolas George <george@nsup.org>
- Re: utf
  - From: Greg Wooledge <wooledg@eeg.ccf.org>

Prev by Date: Re: Invalid UTF-8 byte?
Next by Date: Re: tcp_probe module missing
Previous by thread: Re: utf
Next by thread: Re: utf
Index(es):
- Date
- Thread