coreutils wc count multi bytes question

To: debian-devel@lists.debian.org
Subject: coreutils wc count multi bytes question
From: Neo Anderson <javadeveloper999@yahoo.co.uk>
Date: Fri, 6 Feb 2009 15:18:51 -0800 (PST)
Message-id: <[🔎] 214248.49608.qm@web24710.mail.ird.yahoo.com>
Reply-to: javadeveloper999@yahoo.co.uk

Hi

Not very sure whether this is the right place to ask. But after searching the mailing list at http://www.debian.org/MailingLists/subscribe, I can't find a better one to post my question. So ask it here. 

My question is - does wc can count multi bytes characters, such as Big5/ UTF-8 Chinese? If not, maybe I can help to modify source to get it count words directly. 

Env: kernel 2.6.27.8/ wc 6.10/ gcc version 4.3.2 / Debian lenny/ LANG en_US..UTF-8

I have a file named e.g. abc which contains Chinese and English characters. It may display as below (not very sure whether it can be seen in the mailing list)

this is a 文件 vi 打的

The manual words count are 8 characters. But the output of wc -w is 6. It seems like it is separated as token by white space. So the characters of Chinese which concatenates together would be treated as one character; resulting in the total words count is 6. 

I check the source, it seems it does not check if the input characters are multi bytes or not (e.g.wchar_t). So basically just to check if this has been done already. 

Thanks for help,

Reply to:

Follow-Ups:
- Re: coreutils wc count multi bytes question
  - From: Samuel Thibault <samuel.thibault@ens-lyon.org>

Prev by Date: Re: cgroup mount point
Next by Date: Re: coreutils wc count multi bytes question
Previous by thread: Bug#514323: RFA: ezmlm-browse -- Web browser for ezmlm-idx archives
Next by thread: Re: coreutils wc count multi bytes question
Index(es):
- Date
- Thread