Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages

To: debian-www@lists.debian.org
Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>
Date: Wed, 08 Jan 2003 10:23:58 +0900 (JST)
Message-id: <[🔎] 20030108.102358.91281532.debian@tmail.plala.or.jp>
In-reply-to: <[🔎] 20030107.212924.60854018.debian@tmail.plala.or.jp>
References: <[🔎] 20030105.093611.01369432.debian@tmail.plala.or.jp> <[🔎] 20030107113137.GB28722@cibalia.gkvk.hr> <[🔎] 20030107.212924.60854018.debian@tmail.plala.or.jp>

Hi,

From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>
Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
Date: Tue, 07 Jan 2003 21:29:24 +0900 (JST)

> Anyway, though I don't know such a module, your way can be very easily
> implemented.  I think the easiest one is like following:
> 
>       $name =~ s/([\x80-\xff])/"&#".ord($1).";"/eg;

I wrote a new filter which
  - assume the input string is UTF-8 if it can be interpreted as such,
  - assume it is ISO-8859-1 if not.

Since UTF-8 encoding method is relatively strict, it is not likely that
ISO-8859-1-intended string is wrongly assumed to be UTF-8.  I confirmed
that people.names has no octet stream which can be interpreted as UTF-8.
(Individual 8bit character must not be UTF-8; in UTF-8, 8bit character
must appear in series.)

With this filter, my concern is completely solved.  Also you don't need
to think about future maintainance labor when a new maintainer uses 8bit
characters for his/her name.

#!/usr/bin/perl

sub from_utf8_or_iso88591_to_sgml ($) {
    my $str=$_[0];
    my $strsave = $str;
    if ($str !~ /[\x80-\xff]/) {
	# return ASCII string for less machine-time consumption.
	return $str;
    }
    $str =~ s/([\xf0-\xf7])([\x80-\xbf])([\x80-\xbf])([\x80-\xbf])/
	"&#" .
	((ord($1)&0x7)* 0x40000 +
	(ord($2)&0x3f)* 0x1000 +
	(ord($3)&0x3f)* 0x40 +
	(ord($4)&0x3f)) . ";"/eg;
    $str =~ s/([\xe0-\xef])([\x80-\xbf])([\x80-\xbf])/
	"&#" .
	((ord($1)&0xf)* 0x1000 +
	(ord($2)&0x3f)* 0x40 +
	(ord($3)&0x3f)) . ";"/eg;
    $str =~ s/([\xc0-\xdf])([\x80-\xbf])/
	"&#" .
	((ord($1)&0x1f)* 0x40 +
	(ord($2)&0x3f)) . ";"/eg;
    if ($str !~ /[\x80-\xff]/) {
	# $str is UTF-8 compliant, assume UTF-8.
	return $str;
    } else {
	# $str is not UTF-8 compliant, assume ISO-8859-1.
	$strsave =~ s/([\x80-\xff])/"&#".ord($1).";"/eg;
	return $strsave;
    }
}

while(<>) {
    chomp($_);
    print from_utf8_or_iso88591_to_sgml($_);
}

Reply to:

Follow-Ups:
- Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
  - From: Josip Rodin <joy@gkvk.hr>

References:
- Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
  - From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>
- Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
  - From: Josip Rodin <joy@gkvk.hr>
- Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
  - From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>

Prev by Date: Processed: reopen, bah
Next by Date: Bug#175706: marked as done (typo in german translation of http://www.debian.org/security/faq#care)
Previous by thread: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
Next by thread: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
Index(es):
- Date
- Thread