[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages



Hi,

From: Tomohiro KUBOTA <debian@tmail.plala.or.jp>
Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages
Date: Tue, 07 Jan 2003 21:29:24 +0900 (JST)

> Anyway, though I don't know such a module, your way can be very easily
> implemented.  I think the easiest one is like following:
> 
>       $name =~ s/([\x80-\xff])/"&#".ord($1).";"/eg;

I wrote a new filter which
  - assume the input string is UTF-8 if it can be interpreted as such,
  - assume it is ISO-8859-1 if not.

Since UTF-8 encoding method is relatively strict, it is not likely that
ISO-8859-1-intended string is wrongly assumed to be UTF-8.  I confirmed
that people.names has no octet stream which can be interpreted as UTF-8.
(Individual 8bit character must not be UTF-8; in UTF-8, 8bit character
must appear in series.)

With this filter, my concern is completely solved.  Also you don't need
to think about future maintainance labor when a new maintainer uses 8bit
characters for his/her name.

#!/usr/bin/perl

sub from_utf8_or_iso88591_to_sgml ($) {
    my $str=$_[0];
    my $strsave = $str;
    if ($str !~ /[\x80-\xff]/) {
	# return ASCII string for less machine-time consumption.
	return $str;
    }
    $str =~ s/([\xf0-\xf7])([\x80-\xbf])([\x80-\xbf])([\x80-\xbf])/
	"&#" .
	((ord($1)&0x7)* 0x40000 +
	(ord($2)&0x3f)* 0x1000 +
	(ord($3)&0x3f)* 0x40 +
	(ord($4)&0x3f)) . ";"/eg;
    $str =~ s/([\xe0-\xef])([\x80-\xbf])([\x80-\xbf])/
	"&#" .
	((ord($1)&0xf)* 0x1000 +
	(ord($2)&0x3f)* 0x40 +
	(ord($3)&0x3f)) . ";"/eg;
    $str =~ s/([\xc0-\xdf])([\x80-\xbf])/
	"&#" .
	((ord($1)&0x1f)* 0x40 +
	(ord($2)&0x3f)) . ";"/eg;
    if ($str !~ /[\x80-\xff]/) {
	# $str is UTF-8 compliant, assume UTF-8.
	return $str;
    } else {
	# $str is not UTF-8 compliant, assume ISO-8859-1.
	$strsave =~ s/([\x80-\xff])/"&#".ord($1).";"/eg;
	return $strsave;
    }
}

while(<>) {
    chomp($_);
    print from_utf8_or_iso88591_to_sgml($_);
}


Reply to: