Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
From: Thorsten Glaser <tg@mirbsd.de>
Date: Fri, 03 Jun 2016 19:29:27 +0200
Message-id: <[🔎] 146497496703.29811.8004706135756649870.reportbug@tglase.lan.tarent.de>
Reply-to: Thorsten Glaser <tg@mirbsd.de>, 826256@bugs.debian.org

Package: locales
Version: 2.22-0experimental0
Severity: normal
Tags: upstream

Starting with locales 2.22-0experimental0, some chars have the wrong
width; downgrading locales to 2.21-9 fixes the bugs.

Test program:

tglase@tglase:~ $ cat x.c
#define _XOPEN_SOURCE
#include <locale.h>
#include <stdio.h>
#include <wchar.h>

#define D(x) printf("%04X %d\n",(x),wcwidth(x))

int
main(void)
{
	setlocale(LC_ALL, "");

	D(0x41);
	D(0x0300);
	D(0x3000);
	D(0x4DC0);
	D(0xFFFD);
	return (0);
}
tglase@tglase:~ $ gcc x.c
tglase@tglase:~ $ rm -rf tloc; mkdir tloc                                                                  
tglase@tglase:~ $ localedef -i en_US -c -f UTF-8 tloc/en_US.UTF-8                                          
tglase@tglase:~ $ LOCPATH=$PWD/tloc LC_ALL=en_US.UTF-8 ./a.out                                             
0041 1
0300 0
3000 2
4DC0 1
FFFD 1

Output while locales_2.21-9_all.deb was installed during localedef:

tglase@tglase:~ $ LOCPATH=$PWD/tlocx LC_ALL=en_US.UTF-8 ./a.out                                            
0041 1
0300 0
3000 2
4DC0 2
FFFD 1

This is because /usr/share/i18n/charmaps/UTF-8.gz now lacks
entries for 4DC0‥4FFF.

According to my own code implementing Unicode in another operating
system, with focus on wcwidth(3), after parsing EastAsianWidth.txt
special handling is needed to set widths of 0xFF00, 0x3248‥0x324F,
and 0x4DC0‥0x4DFF to “wide”, as they’re “neutral” normally – which
can be either – but display on a fixed-width terminal is otherwise
impossible. (Chars outside the BMP were not considered – there may
be others needing such handling… personally, I’d consider at least
all emouji need to be fullwidth but there’s no standard backing it
yet.)

Rationale here: compatibility with wcwidth(3) implementations such
as the one in xterm. (I’ve done the code in MirBSD to generate the
data for my new wcwidth(3) implementation carefully so that – when
using the same Unicode version as Markus Kuhn did – both implemen‐
tations return the same width for all characters.)

This is especially important as I happen to use ䷐ (U+4DD0) for UI
elements, and now all I get is a half-width replacement character,
due to X11 font selection choosing the half-width font part, for a
full-width character cell with an empty right half.

-- System Information:
Debian Release: stretch/sid
  APT prefers unreleased
  APT policy: (500, 'unreleased'), (500, 'buildd-unstable'), (500, 'unstable')
Architecture: x32 (x86_64)
Foreign Architectures: i386, amd64

Kernel: Linux 4.5.0-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=C, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/lksh
Init: sysvinit (via /sbin/init)

Versions of packages locales depends on:
ii  debconf [debconf-2.0]  1.5.59
ii  libc-bin               2.22-10
ii  libc-l10n              2.22-10

locales recommends no packages.

locales suggests no packages.

-- debconf information:
  locales/locales_to_be_generated:
  locales/default_environment_locale: None

Reply to:

Follow-Ups:
- Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
  - From: Aurelien Jarno <aurelien@aurel32.net>
- Processed: Re: Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
  - From: owner@bugs.debian.org (Debian Bug Tracking System)
- Processed: Re: Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
  - From: owner@bugs.debian.org (Debian Bug Tracking System)

Prev by Date: Bug#825865: glibc: Testsuite failure on sparc64 due to unaligned access in wcsmbs/test-wcsncmp.c
Next by Date: Processed: your mail
Previous by thread: Bug#825865: glibc: Testsuite failure on sparc64 due to unaligned access in wcsmbs/test-wcsncmp.c
Next by thread: Bug#826256: locales: wrong width for hexagrams (and possibly others) in 2.22
Index(es):
- Date
- Thread