coreutils wc count multi bytes question
Not very sure whether this is the right place to ask. But after searching the mailing list at http://www.debian.org/MailingLists/subscribe, I can't find a better one to post my question. So ask it here.
My question is - does wc can count multi bytes characters, such as Big5/ UTF-8 Chinese? If not, maybe I can help to modify source to get it count words directly.
Env: kernel 184.108.40.206/ wc 6.10/ gcc version 4.3.2 / Debian lenny/ LANG en_US..UTF-8
I have a file named e.g. abc which contains Chinese and English characters. It may display as below (not very sure whether it can be seen in the mailing list)
this is a 文件 vi 打的
The manual words count are 8 characters. But the output of wc -w is 6. It seems like it is separated as token by white space. So the characters of Chinese which concatenates together would be treated as one character; resulting in the total words count is 6.
I check the source, it seems it does not check if the input characters are multi bytes or not (e.g.wchar_t). So basically just to check if this has been done already.
Thanks for help,