[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

coreutils wc count multi bytes question



Hi

Not very sure whether this is the right place to ask. But after searching the mailing list at http://www.debian.org/MailingLists/subscribe, I can't find a better one to post my question. So ask it here. 

My question is - does wc can count multi bytes characters, such as Big5/ UTF-8 Chinese? If not, maybe I can help to modify source to get it count words directly. 

Env: kernel 2.6.27.8/ wc 6.10/ gcc version 4.3.2 / Debian lenny/ LANG en_US..UTF-8

I have a file named e.g. abc which contains Chinese and English characters. It may display as below (not very sure whether it can be seen in the mailing list)

this is a 文件 vi 打的

The manual words count are 8 characters. But the output of wc -w is 6. It seems like it is separated as token by white space. So the characters of Chinese which concatenates together would be treated as one character; resulting in the total words count is 6. 

I check the source, it seems it does not check if the input characters are multi bytes or not (e.g.wchar_t). So basically just to check if this has been done already. 

Thanks for help,








Reply to: