[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: coreutils wc count multi bytes question



The value 8 for this sentence - "this is a 文件 vi 打的" - is done manually, meaning I count it with my brain. So I think it should be correct : )

Using wc -w the value is 6. As you mention that wc treats the words concatenated together; therefore, the result counted by wc -w becomes 6 instead of 8. I do not mean that wc should know chinese enough to count it correctly, but maybe (if needed) I can patch (if I can work it out) to get it count e..g. Chinese words correctly. If I remember correctly that there is a mapping table, so possibly this can be done. But of course, perhaps this is just my wishful thinking.

My English is not very good. Hope my reply is not rude.

Many thanks for your help,


--- On Fri, 6/2/09, Samuel Thibault <samuel.thibault@ens-lyon.org> wrote:

> From: Samuel Thibault <samuel.thibault@ens-lyon.org>
> Subject: Re: coreutils wc count multi bytes question
> To: "Neo Anderson" <javadeveloper999@yahoo.co.uk>
> Cc: debian-devel@lists.debian.org
> Date: Friday, 6 February, 2009, 11:27 PM
> Hello,
> 
> Neo Anderson, le Fri 06 Feb 2009 15:18:51 -0800, a écrit :
> > this is a 文件 vi 打的
> > 
> > The manual words count are 8 characters.
> 
> How do you count that?
> 
> > But the output of wc -w is 6. It seems like it is
> separated as token by white space. So the characters of
> Chinese which concatenates together would be treated as one
> character; resulting in the total words count is 6. 
> 
> Well, yes.  Do you mean that wc should know chinese enough
> to determine
> whether a few kanjis form a word or not?
> 
> Samuel





Reply to: