Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
Hi,
From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>
Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
Date: Tue, 07 Jan 2003 21:29:24 +0900 (JST)
> Anyway, though I don't know such a module, your way can be very easily
> implemented. I think the easiest one is like following:
>
> $name =~ s/([\x80-\xff])/"&#".ord($1).";"/eg;
I wrote a new filter which
- assume the input string is UTF-8 if it can be interpreted as such,
- assume it is ISO-8859-1 if not.
Since UTF-8 encoding method is relatively strict, it is not likely that
ISO-8859-1-intended string is wrongly assumed to be UTF-8. I confirmed
that people.names has no octet stream which can be interpreted as UTF-8.
(Individual 8bit character must not be UTF-8; in UTF-8, 8bit character
must appear in series.)
With this filter, my concern is completely solved. Also you don't need
to think about future maintainance labor when a new maintainer uses 8bit
characters for his/her name.
#!/usr/bin/perl
sub from_utf8_or_iso88591_to_sgml ($) {
my $str=$_[0];
my $strsave = $str;
if ($str !~ /[\x80-\xff]/) {
# return ASCII string for less machine-time consumption.
return $str;
}
$str =~ s/([\xf0-\xf7])([\x80-\xbf])([\x80-\xbf])([\x80-\xbf])/
"&#" .
((ord($1)&0x7)* 0x40000 +
(ord($2)&0x3f)* 0x1000 +
(ord($3)&0x3f)* 0x40 +
(ord($4)&0x3f)) . ";"/eg;
$str =~ s/([\xe0-\xef])([\x80-\xbf])([\x80-\xbf])/
"&#" .
((ord($1)&0xf)* 0x1000 +
(ord($2)&0x3f)* 0x40 +
(ord($3)&0x3f)) . ";"/eg;
$str =~ s/([\xc0-\xdf])([\x80-\xbf])/
"&#" .
((ord($1)&0x1f)* 0x40 +
(ord($2)&0x3f)) . ";"/eg;
if ($str !~ /[\x80-\xff]/) {
# $str is UTF-8 compliant, assume UTF-8.
return $str;
} else {
# $str is not UTF-8 compliant, assume ISO-8859-1.
$strsave =~ s/([\x80-\xff])/"&#".ord($1).";"/eg;
return $strsave;
}
}
while(<>) {
chomp($_);
print from_utf8_or_iso88591_to_sgml($_);
}
Reply to: