[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Package description page is not compliant to multibyte characters



Hi,

I heard a report that Japanese translation of package description
pages (by Debian Description Translation Project) is broken.
For example,

    http://ddtp.debian.org/packages.debian.org/stable/admin/apmd.ja.html

(It might be difficult to understand the page is broken if you
cannot read Japanese.)

Analysis:

This page seems to be generated by a script
gluck.debian.org:/org/packages.debian.org/htmlscripts/pages.pl
and the first character of the long description of a package
is written in larger font:

    $long_desc =~ /^([^&]|&[^;]+;)/;
    $first = $1;
    $rest = substr($long_desc,length($first));
    $package_page .= "<p style=\"text-align: justify\"><font size=\"+2\">$first</font>$rest\n";

However, in multibyte encodings such as EUC-JP (Japanese),
a character may be consist of multiple bytes.  On the other
hand the expression [^&] matches one *byte* rather than
one *character*.  Thus, when the first character of the
long description is a multibyte character, $first will be
the first byte of the multibyte character, not entire the
multibyte character.

Solution:

Right way is to make the script multibyte-compliant.  It may
be difficult to support arbitrary encodings.  However, it may
be easy to support a limited range of multibyte encodings
which are possible candidates for Debian web pages (such as
"EUC-JP, EUC-KR, GB2312, Big5, Big5HKSCS, and UTF-8").

I heard that there is an another solution like following:

  <style>
  <!--
  p.description {text-align: justify;}
  p.description:first-letter {font: 150%;}
  -->
  </style>

and

  <p class="description">This is a long package discription.</p>

Though this solution is environment-dependent, at least this
way never make the content unreadable.


However, an another solution is to give up the decorating
by using larger font for the first character.  I think this
might be a good solution because "using larger font for
the first character" cannot be truely universal.  Imagine
Arabic characters.  How can the first character of an Arabic
word be a larger?  Though we don't have Arabic translation
yet, we may have in future.

Thus, my suggestion is to give up the decoration.  However,
I will appreciate any other solutions which will stop breaking
the contents.

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/




Reply to: